The token economy is a reality, and while discussing its details is beyond the scope of this article, most companies see their AI spending budgets being surpassed faster than planned.

So let's discuss what happens when you (or your agent) relies on LLMs to execute actions and generate code. The examples here will come from a code generation standpoint, but they largely apply to most contexts.

LLM Knowledge, Prompts, and Context Windows

If you have been using LLM as your coding agent, it is possible that you never stopped to understand a bit more about how this interaction takes place beyond the visible part (the prompt you type and send).

Any LLM has encoded its training data (knowledge) and uses the input you provide (the prompt) to suggest what will be the next entry.

None
Figure 1. How LLMs generate the responses.

So this cycle feeds an ever-growing prompt to the LLM, and based on its attention mechanism, it picks the most likely candidate to fill the gap.

From this simplistic definition, let's explore some ramifications:

1 — Hallucinations due to Knowledge Cutoff Date

The LLM only knows what it has been trained on or what you provided as part of the context. This means that a new version of the software that uses different options or an SDK released after the cutoff date will not be part of the training data.

The result: hallucination. Unless specifically told, the most common reaction is that it will provide something to answer your question, even if it is fabricated or invalid to the actual context of your problem.

2 — Forgetting Details due to Context Rot

When we interact with an LLM, the tool we use (Claude Code, Visual Studio Code+Github Copilot, Codex CLI, etc) defines and injects data alongside what you type.

None
Figure 2. Visual Studio Code Chat Debug with system prompt.

Each agent has a different definition, and they constantly evolve those definitions, but you can recognize some common parts:

  • System prompt
  • User environment
  • Tools definitions
  • User prompt

Remember the loop we talked about? As the LLM returns its output to use, and as we continue interacting with it (automatically or manually), we keep sending the entire conversation back as the new context to the LLM.

So all models will inform their context window size, which is the maximum number of tokens it can hold. It is not uncommon for frontier models to have 200K upwards of 1M tokens.

While these numbers look big, let's look at the following data:

None
Figure 3. Context window usage after a hello world message

This was the result of a single message being exchanged with a model. No lengthy specification or multi-step feature implementation.

So, before I sent any meaningful piece of information, the context was already at 10% used. To make matters worse, as the context usage grows, the quality of the responses provided can deteriorate rapidly.

None
Figure 4. Performance degradation as the number of tokens increase. Source

This image, while using some older models, indicates that there is a sharp decline at 40% usage already. So all of a sudden, you may enter the dumb area of the context without even realizing. Newer models claim to be better at this, but no new study has been done so far to back this up independently.

So, how to get a peek at what is happening behind the scenes, how many tokens are being used, what is the system prompt, what tools are used, etc.? OpenTelemetry to the rescue.

OpenTelemetry

OpenTelemetry is a well-defined and adopted standard that enables systems to send logs, traces, and metrics. It has been used for a long time and is available for most languages and supported by standalone and SaaS applications.

Visual Studio Code recently added support to expose Copilot Chat information using this format. The full documentation can be seen here, but the simplest way to enable it is to select this setting.

None
Figure 5. Enabling OTEL in Visual Studio Code.

After a restart, it will automatically start emitting to the HTTP endpoint http://localhost:4318. You have other settings that enable you to customize this endpoint, use gRPC instead, and even specify what data you want to be available as part of it. By default, only system data is emitted to avoid privacy concerns.

To collect and visualize the information, you can tweak the above location and point it to your OTEL software of choice. I will use Aspire as it offers both native Linux and Docker images that can be installed easily.

None
Figure 6. Installing Aspire locally for testing purposes.

You can then start it

None
Figure 7. Starting Aspire to collect OTEL data.

What is nice about this solution is that it offers many features, including a dashboard.

To see it in action, let's ask about creating an application that will use OTEL for that purpose (very meta, I know)

None
Figure 8. Using the coding agent to generate information.

As it is executed, the dashboard starts being populated

None
Figure 9. OTEL logs showing the activity.

We can see some other calls that Visual Studio Code is doing even before I pressed enter, like the ones labeled "GenAI inference: gpt-4o-mini-2024–07–18". By clicking on one of those, I can see details of the logs and the associated tracing information

None
Figure 10. Seeing the trace details of one chat.

You can see many important aspects in this trace alone: how many times, which model, and the time taken by those calls, which tools were executed, and which ones happened in parallel.

By selecting one of these calls, in this case the first one, we have access to what was sent to the agent in that call, alongside other important pieces of information.

None
Figure 11. Detailed information of a single call in a trace. We can see messages, tokens, model, instructions, etc.

For example, you can already see that we had ~13K input tokens for that first interaction. You can even expand the information about the system instructions and tool definitions to know more about what was used.

None
Figure 12. Expanding the system instructions used for that call.

We even have a handy copy to clipboard to help you analyze that further outside this dashboard.

If you are using subagents, you can see their traces as well, and they are grouped to facilitate exploration.

None
Figure 13. Subagents are also tracked, with their calls grouped to facilitate.

Besides the traces and logs, we also have a set of metrics that GitHub Copilot emits. You can filter based on the resource and choose what metric to see the graph/data.

None
Figure 14. Accessing metrics (duration)

The token usage is also available to give you an idea of how much that session costs you.

None
Figure 15. Token usage metric

We can have the same when using Claude Code from within Visual Studio Code and the command line version.

If you want to obtain telemetry from Claude Code itself, you can do so by setting some environment variables.

To enable the telemetry feature

None

Then you configure what you will be emitting for each category, which in our case, we want metrics, logs, and traces via Open Telemetry.

None

And finally, which protocol and where those should be sent to

None

Full documentation can be found here, where you can set authentication, intervals, and many more options, including whether you want the actual prompts to be included as well.

Once you set the variables and start a new Claude code execution, you will be able to see those traces flowing as well.

None
Figure 16. Claude code logs on Aspire.

Wrapping Up

Having a deeper understanding of how your coding agent interacts with the LLMs is key to identifying points where you can miss opportunities for improvement or hurt the quality of your results by bloating the context with unnecessary data.

Establishing a standard way that can be used to gather and analyze your interactions with those coding agents can reveal those opportunities and give you more insights.

Both GitHub Copilot and Claude Code offer you Open Telemetry support and can be used individually or at a team level to know more and take action.

The examples I showed used a locally hosted version of the OTEL that can be enough for a solo developer, but can be expanded and hosted elsewhere, so the entire team can trace their progress.