Observability in Agentic Flows - Debugging the "Black Box"

·
observabilitytracingdebuggingagent-internals

An agent deletes a staging database table. The logs say: DELETE FROM config_overrides;.

What they don't say is why.

Turns out the agent couldn't find a configuration file. It was looking in the wrong directory. It concluded the environment was corrupted. It decided a "clean reset" was the best course of action.

Perfectly logical reasoning. Absolutely catastrophic outcome.

The file existed. It was in /opt/app/config.yaml. The agent just didn't look there.

This is the observability problem in agentic systems. Traditional logging doesn't solve it.

A Fourth Pillar

In regular software, there are three pillars: logs, metrics, traces. They tell you what happened, how often, and in what order.

Usually, that's enough.

For agents, you need a fourth: reasoning logs.

A regular log: Tool called: deleteTable(config_overrides)

A reasoning log: "Searched for config.yaml in /etc/app/ and /home/app/. Neither path contained the file. This suggests the environment is corrupted. Will reset by clearing the overrides table."

The first tells you what happened. The second tells you why — and more importantly, where the reasoning went wrong.

Externalizing the Chain of Thought

The biggest mistake: treating the agent's reasoning as an internal implementation detail.

It's not. In production, the chain of thought is your most important debugging signal.

Force every agent to output its reasoning into a structured field captured by the logging infrastructure. Not as a debug log level that gets filtered out in production — as a first-class field on every agent action.

interface AgentAction {
  actionId: string;
  timestamp: number;
  reasoning: string;      // WHY the agent chose this action
  plan: string[];         // What it intends to do next
  toolName: string;
  toolArgs: Record<string, unknown>;
  result: unknown;
  durationMs: number;
}

When something goes wrong, don't grep for errors. Grep for the reasoning.

The reasoning tells you where the agent's mental model diverged from reality. That's almost always the root cause.

Spans for Tool Calls

Every tool execution should be its own OpenTelemetry span. This isn't optional.

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('agent-tools');

async function instrumentedToolCall(
  toolName: string,
  args: Record<string, unknown>,
  reasoning: string
) {
  return tracer.startActiveSpan(`agent.tool.${toolName}`, async (span) => {
    span.setAttribute('agent.reasoning', reasoning);
    span.setAttribute('tool.name', toolName);
    span.setAttribute('tool.args', JSON.stringify(args));
    
    try {
      const result = await executeTool(toolName, args);
      span.setAttribute('tool.result', JSON.stringify(result));
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.setStatus({ 
        code: SpanStatusCode.ERROR, 
        message: error instanceof Error ? error.message : 'Unknown error' 
      });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Now when you look at a trace, you don't just see "tool A called tool B." You see why tool A was called, what it expected to happen, and what actually happened.

That context makes the difference between a 5-minute diagnosis and a 5-hour one.

Agents Don't Execute Linearly

Here's something that catches people off guard: agents backtrack. They branch. They retry.

A flat list of log lines is nearly useless for understanding what an agent actually did.

See the red path? The agent decided to backtrack and try a different approach. In a flat log, this looks like the agent did the research twice for no reason. In a tree view, you can see the decision point and why it changed course.

Tools like LangSmith visualize this automatically. But even a custom solution — storing parent-child relationships between actions and rendering them as a tree — is worth building. A parentActionId field and a React tree component. One day of work. Weeks of saved debugging time.

What to Actually Monitor

After running agents in production for a while, these are the metrics that actually matter:

  • Reasoning divergence rate — How often the agent's stated plan doesn't match what it executed. High rate means confusing prompts.
  • Backtrack frequency — How often the agent undoes or retries a step. Some is healthy. Too much means thrashing.
  • Tool call success rate per tool — Not overall. Per tool. If readFile fails 30% of the time, that's a config problem, not an agent problem.
  • Total tokens per task — Not just for cost. As a proxy for efficiency. If the same task keeps using more tokens over time, something is drifting.

None of these exist in standard APM tools.

You have to build them. But once you have them, you can reason about agent performance the same way you reason about service performance.


How to cite
Pokhrel, N. (2026). "Observability in Agentic Flows - Debugging the "Black Box"". Native Agents. https://nativeagents.dev/posts/internals/observability-in-agentic-flows