How Conversational AI Services Like Claude Code Are Built

When you use conversational AI services like Claude Code or ChatGPT, you might wonder: how is this actually built?

It looks like magic from the outside, but the core architecture is surprisingly straightforward. With just an LLM API, anyone can build something similar. Making it work "well" is a different story, of course.

Three Core Components: LLM + Tool Use + Agent Loop

The skeleton of a conversational AI service looks like this:

LLM: A language model that receives user messages and generates responses
Tool Use (Function Calling): The capability for LLMs to invoke external tools when needed
Agent Loop: A loop that repeats "think → act → observe → think again"

When these three components combine, you get more than a chatbot that just outputs text—you get an "agent" that can read and write files, search the web, and execute code.

Tool Use: Giving the LLM Hands and Feet

By default, LLMs can only output text. But with Tool Use, LLMs can express intentions like "I want to read a file" or "I want to execute code."

Here's how it works. When you call the LLM API, you send along a list of available tools.

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-5-20250929",
  messages: [
    { role: "user", content: "Read the README.md file for me" }
  ],
  tools: [
    {
      name: "read_file",
      description: "Tool for reading file contents",
      input_schema: {
        type: "object",
        properties: {
          path: { type: "string", description: "File path" }
        },
        required: ["path"]
      }
    }
  ]
});

Instead of returning plain text, the LLM sends back a tool_use block like this:

{
  "type": "tool_use",
  "id": "toolu_01A09q90qw90lq917835lq9",
  "name": "read_file",
  "input": {
    "path": "README.md"
  }
}

The LLM doesn't execute the tool directly. Instead, it communicates in JSON: "I want to use this tool like this." Actually executing the tool is the developer's responsibility.

Agent Loop: Feeding Results Back to the LLM

Once you've executed a tool, you need to tell the LLM the results. This is the core of the Agent Loop.

// 1. LLM requests a tool_use
const response1 = await anthropic.messages.create({
  model: "claude-sonnet-4-5-20250929",
  messages: [
    { role: "user", content: "Read the README.md file for me" }
  ],
  tools: [...]
});

// 2. Developer executes the tool
const fileContent = fs.readFileSync("README.md", "utf-8");

// 3. Add execution result back to messages and call LLM again
const response2 = await anthropic.messages.create({
  model: "claude-sonnet-4-5-20250929",
  messages: [
    { role: "user", content: "Read the README.md file for me" },
    { role: "assistant", content: response1.content }, // includes tool_use block
    {
      role: "user",
      content: [
        {
          type: "tool_result",
          tool_use_id: "toolu_01A09q90qw90lq917835lq9",
          content: fileContent
        }
      ]
    }
  ],
  tools: [...]
});

LLM APIs are stateless. They don't remember previous conversations. So you need to send the entire conversation history every time.

You build up the messages array with user messages, LLM responses, and tool execution results, repeatedly calling the API. That's the Agent Loop.

ReAct Pattern: Think, Act, Observe

This loop approach is called the "ReAct pattern"—Reasoning and Acting in a cycle.

Thought: LLM plans what to do next
Action: Requests a tool call
Observation: Reviews tool execution results
Back to Thought: Decides the next action based on results

This process repeats until the stop_reason becomes "end_turn," at which point you exit the loop.

let messages = [{ role: "user", content: "Read README.md and summarize it" }];

while (true) {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-5-20250929",
    messages,
    tools: [...]
  });

  messages.push({ role: "assistant", content: response.content });

  if (response.stop_reason === "end_turn") {
    // Final response reached
    break;
  }

  if (response.stop_reason === "tool_use") {
    // Find and execute tool_use block
    const toolUse = response.content.find(block => block.type === "tool_use");
    const result = executeTool(toolUse.name, toolUse.input);

    messages.push({
      role: "user",
      content: [{
        type: "tool_result",
        tool_use_id: toolUse.id,
        content: result
      }]
    });
  }
}

This creates an autonomous agent where the LLM calls necessary tools on its own, reviews results, and decides the next action.

Streaming: Showing Responses in Real-Time

In real services, streaming functionality is essential to show LLM responses in real-time. Waiting for the complete response creates a poor user experience.

Streaming implementation happens in two stages.

Stage 1: LLM API → Server (Token Streaming)

const stream = await anthropic.messages.stream({
  model: "claude-sonnet-4-5-20250929",
  messages,
  tools: [...]
});

for await (const event of stream) {
  if (event.type === "content_block_delta") {
    if (event.delta.type === "text_delta") {
      process.stdout.write(event.delta.text);
    }
  }

  if (event.type === "content_block_start") {
    if (event.content_block.type === "tool_use") {
      console.log(`\n[Tool call: ${event.content_block.name}]`);
    }
  }
}

The LLM API sends events token by token. Text and tool_use arrive interleaved.

Stage 2: Server → Client (SSE)

The server forwards the stream from the LLM directly to the browser. Typically using SSE (Server-Sent Events).

// Server (Next.js API Route)
export async function POST(req) {
  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    async start(controller) {
      const llmStream = await anthropic.messages.stream({...});

      for await (const event of llmStream) {
        controller.enqueue(
          encoder.encode(`data: ${JSON.stringify(event)}\n\n`)
        );
      }

      controller.close();
    }
  });

  return new Response(stream, {
    headers: { "Content-Type": "text/event-stream" }
  });
}

// Client
const response = await fetch("/api/chat", {
  method: "POST",
  body: JSON.stringify({ message: "Read README.md for me" })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value);
  const events = chunk.split("\n\n").filter(line => line.startsWith("data: "));

  for (const event of events) {
    const data = JSON.parse(event.slice(6));
    if (data.type === "content_block_delta") {
      // Update UI
      appendToChat(data.delta.text);
    }
  }
}

This makes the response appear on screen in real-time, as if someone is typing.

Multiple Agent Loops in a Single Stream

The key here is connecting multiple LLM calls within a single SSE connection.

const stream = new ReadableStream({
  async start(controller) {
    let messages = [{ role: "user", content: userMessage }];

    while (true) {
      const llmStream = await anthropic.messages.stream({
        messages,
        tools: [...]
      });

      // Forward LLM stream to client
      for await (const event of llmStream) {
        controller.enqueue(encoder.encode(`data: ${JSON.stringify(event)}\n\n`));
      }

      const finalMessage = await llmStream.finalMessage();
      messages.push({ role: "assistant", content: finalMessage.content });

      if (finalMessage.stop_reason === "end_turn") break;

      if (finalMessage.stop_reason === "tool_use") {
        // Execute tool
        const toolResults = await executeTools(finalMessage.content);
        messages.push({ role: "user", content: toolResults });

        // Notify client of tool execution results
        controller.enqueue(encoder.encode(`data: ${JSON.stringify({
          type: "tool_result",
          results: toolResults
        })}\n\n`));
      }
    }

    controller.close();
  }
});

From the user's perspective, it looks like one long conversation, but internally the system is calling the LLM API multiple times, executing tools, and receiving results.

Same Pattern for All LLMs

This pattern is identical across Claude, OpenAI, Google Gemini, and even open-source models.

// OpenAI
const response = await openai.chat.completions.create({
  model: "gpt-4",
  messages: [...],
  tools: [
    {
      type: "function",
      function: {
        name: "read_file",
        parameters: { ... }
      }
    }
  ]
});

// Google Gemini
const response = await model.generateContent({
  contents: [...],
  tools: [
    {
      functionDeclarations: [
        {
          name: "read_file",
          parameters: { ... }
        }
      ]
    }
  ]
});

The field names differ slightly, but the structure is the same. You can build agents this way with any LLM that supports Tool Use.

Libraries like Vercel AI SDK abstract away even these differences. You can use the same code with different LLMs by just swapping the model.

Context Management: The Really Hard Problem

Here's a question that comes up naturally. Coding agents like Claude Code need to understand multiple files in a project. Do they dump all file contents into the LLM at once?

No. Agents read files on demand, as needed.

Turn 1: LLM → "Let me check the project structure" → calls list_directory
Turn 2: LLM → (sees structure) "Let me read the main file" → read_file("src/index.ts")
Turn 3: LLM → "Let me check related files too" → read_file("src/utils.ts")
Turn 4: LLM → "Now I'll make the changes" → write_file(...)

The LLM decides which files to read on its own and progressively builds up context through tool calls.

The Growing Messages Array Problem

But every time a file is read, its contents get added to the messages array. Since LLM APIs are stateless, you have to send the entire history every time:

messages = [
  // user: "Refactor this project"              →     50 tokens
  // assistant: tool_use(list_directory)         →    100 tokens
  // user: tool_result(directory structure)      →  2,000 tokens
  // assistant: tool_use(read_file index.ts)     →    100 tokens
  // user: tool_result(file contents)            →  5,000 tokens
  // assistant: tool_use(read_file utils.ts)     →    100 tokens
  // user: tool_result(file contents)            →  3,000 tokens
  // ...keeps growing
]

Tokens accumulate with every file read, and eventually you hit the context window limit. Claude has a 200K token limit, GPT-4o has 128K. Sounds like a lot, but reading a few dozen files fills it up fast.

Solution 1: Summarization

Replace old tool_results with condensed summaries.

// Original: full file contents (5,000 tokens)
{ type: "tool_result", content: "import React from..." /* full contents */ }

// Replaced with summary (200 tokens)
{ type: "tool_result", content: "[Summary] index.ts: React app entry point. Renders App component and configures router." }

Solution 2: Sliding Window

Trim old messages and keep only the most recent N messages.

if (messages.length > MAX_MESSAGES) {
  const systemSummary = await summarize(messages.slice(0, -MAX_MESSAGES));
  messages = [
    { role: "user", content: `[Previous conversation summary] ${systemSummary}` },
    ...messages.slice(-MAX_MESSAGES)
  ];
}

Solution 3: Read Only What You Need

Instead of reading entire files, create tools that fetch only specific line ranges or search results.

tools: [
  { name: "read_file" },        // entire file
  { name: "read_lines" },       // specific line range
  { name: "grep" },             // pattern matching results
  { name: "search_codebase" },  // keyword search
]

This is exactly what Claude Code does. It uses Grep to search for relevant files first, then reads only the necessary parts with Read. It doesn't blindly read entire files.

Solution 4: Sub-Agents

Like Claude Code's Task tool, you can spawn separate agents for different parts of the work.

Main Agent (context: full conversation)
  ├── Sub-Agent A: "Analyze src/auth/"  → independent context
  ├── Sub-Agent B: "Analyze src/api/"   → independent context
  └── Sub-Agent C: "Analyze tests"      → independent context

Each sub-agent has its own messages array. When finished, it returns only a summary to the main agent. This way you don't need to fit everything into a single context window.

Context management is one of the hardest parts of building agents "well." The Agent Loop itself is simple, but deciding what information to keep and what to discard as conversations grow is what separates good services from great ones.

Where Does Differentiation Come From?

If the basic architecture is similar, where do service differences come from? A few key factors stand out.

Model performance is the most important. Given the same tools, some models use them effectively in the right situations, while others misuse them. That's why recent models like Claude Sonnet 4.5 and GPT-4 receive good reviews as coding agents.

System prompt design matters too. How you instruct the LLM with guidelines like "you're a coding expert," "always read files before modifying them," or "retry on errors" completely changes behavior patterns.

Tool design is another differentiator. Even with the same file-reading tool, error handling, permission management, and performance optimization affect user experience.

Context management is also important. As conversations grow longer, so does the messages array. You need to decide what to summarize, what to keep, and when to start fresh context.

Finally, UX matters. Even with identical functionality, UI/UX design completely changes the perceived quality for users.

Lower Barriers to Entry, But...

As LLM APIs have evolved, the barriers to building conversational AI services have dropped significantly. You can implement the core logic in a few hundred lines.

But building it "well" is still hard. Designing good tools, writing appropriate prompts, and implementing stable loops require experience and know-how.

One thing is certain though: we've entered an era where anyone can build their own AI agent. All you need is an LLM API.