The problem with most conversational AI agents right now is that they're optimized for exactly one thing, and that one thing is usually the wrong thing. If you want speed, you get shallow responses from an 8B model that can barely keep up with the nuance of what you're asking. If you want depth with a 70B+ model, you're just too slow, and the conversational piece breaks down. The medium itself is constrained too, because spoken audio is great for casual conversation but completely inadequate when you need to see a proof worked out step by step or understand a complex graph.
Three weeks ago, I built what we called the "Scratchboard" for an internal hackathon at Tavus, and the core challenge was this: how do you build a conversational agent that's both fast enough to feel natural and smart enough to actually teach you something difficult, like walking through a proof in Math 217 or debugging a recursive algorithm in EECS 203? The answer, it turned out, was to stop thinking about this as a single model problem and start thinking about it as an architecture problem where two fundamentally different models work in parallel, each doing what they do best.
The Architecture
The yapper-thinker (lmao) model is built on a simple premise that becomes complex in execution. You have two models running simultaneously: the yapper, which is your fast conversational agent (think 8B or 70B), and the thinker, which is your slow, methodical reasoning engine (think o1 or similar). The yapper handles all the conversational flow, maintaining that snappy back-and-forth you expect from a good chat interface, while the thinker works in parallel on the hard problems, generating artifacts that get rendered on the scratchboard.
The beauty of this split is that each model can be optimized for its specific role without compromise. The yapper doesn't need to be great at mathematical reasoning because that's not its job; it needs to be witty, responsive, and capable of directing the conversation. The thinker doesn't need to be conversational because it's not talking to the user directly; it needs to be thorough, accurate, and capable of producing structured outputs that can be parsed and rendered.
The Context Problem
Soo, here's where things get interesting, because the yapper and the thinker can't operate in complete isolation from each other. When a user says "walk me through this proof," the yapper needs to know what proof it's talking about, but the proof was generated by the thinker, which means the yapper has no inherent knowledge of the proof's contents since it didn't generate any of those tokens. Similarly, when the user pivots to a new topic mid-conversation, the thinker needs to know when to abort its current line of reasoning and start thinking about something new, but it's running in parallel and doesn't have direct access to the conversational flow.
The solution we built involves what I call context injection, which is a carefully orchestrated dance of shared state between the two models. When the thinker completes a thought and populates the scratchboard with a new artifact (a LaTeX proof, a React component, a graph), it doesn't just render that artifact to the frontend. It also injects a compressed representation of that artifact into the yapper's context window, giving the yapper just enough information to reference and discuss the artifact intelligently without having to understand how it was generated.
This injection has to be surgical because context windows, even large ones, are finite and expensive. We can't just dump the entire thinker output into the yapper's context; we need to extract the semantic essence of what was generated and present it in a format that the yapper can work with naturally. For a mathematical proof, this might be the theorem statement, the key steps, and the conclusion. For a code snippet, it might be the function signature, the main logic flow, and any edge cases. The yapper gets enough to talk about the artifact as if it understands it deeply, even though it never actually generated it.
Parallel Inference and Abortion
The thinker runs parallel inferences, which means it's constantly generating potential thoughts based on the conversation state, but not all of those thoughts make it to the scratchboard. Some get aborted because the conversation moved on, some get aborted because they're taking too long, and some get aborted because the yapper explicitly signals that they're no longer relevant. This abortion mechanism is critical because without it, you'd have the thinker burning compute on thoughts that are no longer useful, and you'd have a scratchboard cluttered with outdated or irrelevant artifacts.
The way we handle abortion is through a timeout system combined with explicit signals from the yapper. Each thought has a maximum time budget, and if the thinker hasn't completed that thought within the budget, it gets aborted and the next thought in the queue is injected. The yapper can also send explicit abort tokens when it detects that the user has moved on to a new topic, which immediately kills any in-progress thoughts related to the old topic. This goes back to context hygiene and maintaining a coherent context window for the yapper to not go berzerk.
The Fine-Tuning Layer
None of this works out of the box with off-the-shelf models because the yapper needs to learn how to direct the thinker effectively. We fine-tuned the yapper model on synthetic data that we generated specifically for this architecture, teaching it how to issue directives that the thinker can act on. The thinker was pretty reliable in structuring its outputs so they can be parsed.
The yapper's fine-tuning LoRA adapter focused on learning when to invoke the thinker versus when to handle something conversationally. If a user asks "what's the weather like," the yapper doesn't need to invoke the thinker; it can handle that with a simple conversational response. But if a user asks "can you prove that the square root of 2 is irrational," the yapper needs to recognize that this requires deep reasoning and issue a directive to the thinker. The fine-tuning data included thousands of examples of both types of queries, helping the yapper develop an intuition for when to escalate.
The UX
From the user's perspective, the complexity of the yapper-thinker architecture is completely invisible. They start a conversation, they ask questions, and they get fast conversational responses paired with rich artifacts that appear on the scratchboard as needed. The yapper maintains the conversational flow, keeping things snappy and natural, while the thinker works in the background on the hard problems, producing artifacts that enhance the conversation without slowing it down.
The scratchboard itself is the key to making this work because it gives the thinker a place to put its outputs that's separate from the conversational stream. The user isn't waiting for the thinker to finish before they can continue the conversation; they're talking to the yapper while the thinker works in parallel, and when the thinker finishes, the artifact just appears on the scratchboard. This decoupling of conversational flow from artifact generation is what makes the system feel fast even when the thinker is doing slow, complex reasoning.
The Technical Challenges
Building this required solving several gnarly technical problems that weren't obvious at the outset. The first was managing shared state between two models running in parallel without introducing race conditions or state inconsistencies. We ended up using a message queue architecture where both models publish events to a central queue and subscribe to events from the other model, which gives us a clean separation of concerns and makes it easy to reason about the system's behavior.
The second challenge was handling context window limitations. Both models have finite context windows, and as the conversation progresses, we need to decide what to keep in context and what to evict. For the yapper, we prioritize recent conversational turns and injected artifact summaries. For the thinker, we prioritize the current directive and any relevant background information that might be needed to complete the thought. We use a sliding window approach where old context gets compressed into summaries before being evicted, which preserves the semantic content while reducing the token count.
The third challenge was latency management. The yapper needs to respond within a few hundred milliseconds to feel natural, but the thinker might take several seconds or even tens of seconds to complete a thought. We handle this by having the yapper acknowledge the user's request immediately ("let me work through that proof for you") and then continuing the conversation while the thinker works in the background. When the thinker finishes, the yapper smoothly transitions to discussing the artifact without making the user feel like they were waiting.
What This Enables
The yapper-thinker architecture opens up a class of applications that weren't really feasible before. You can build tutoring systems that walk students through complex proofs or algorithms with the conversational fluidity of a human tutor. You can build data analysis tools that let users explore datasets through natural conversation while generating visualizations and statistical analyses in real time. You can build coding assistants that discuss architecture decisions conversationally while generating and refactoring code in the background.
The key insight is that conversation and deep reasoning are fundamentally different tasks that require different optimizations, and trying to do both with a single model forces you into uncomfortable tradeoffs. By splitting them into separate models that communicate through a well-defined interface, you get the best of both worlds: fast, natural conversation paired with deep, thorough reasoning.
The Future
This is still early, and there's a lot of room for improvement. The context injection mechanism could be smarter about what information to include and how to compress it. The abortion logic could be more sophisticated about predicting which thoughts are likely to be useful and which should be killed early. The fine-tuning could be more targeted, focusing on specific domains where the architecture provides the most value.
I dunno, the core idea feels right. Conversational AI shouldn't be a choice between fast and shallow or slow and deep; it should be both, and the yapper-thinker architecture is one way to get there. There's nothing to suggest I can't add more thinkers to the mix to create a discrete and auditable MoE if we create some sort of framework. I'm building the Lego blocks for this at Tavus, and I'm excited to see what people build with them.
If you're working on similar problems or have thoughts on multi-model architectures, I'd love to hear from you.