When you instrument every model call in production — not a sample, every call — you start to see patterns that no evaluation benchmark would have surfaced. After analyzing over 12 million traces from AI systems across a range of industries, what we found challenges some assumptions that engineering teams routinely make about their deployed models.
Most teams look at median and p95 inference latency during load testing and declare themselves satisfied. The production reality is considerably messier. Across the traces we've analyzed, the p99 latency for LLM inference is typically 3-5x the p95 value — a gap that often goes unnoticed until a specific user pattern triggers it consistently.
The outliers are informative. High-latency calls cluster around specific input characteristics: unusually long context windows, requests that trigger certain reasoning pathways, inputs that arrive during model warm-up after cold starts, and requests that happen to land on an instance experiencing transient resource contention. None of these show up reliably in pre-deployment testing because they require real traffic patterns to trigger.
The practical implication: latency SLAs defined from evaluation data tend to be optimistic. If you're promising users sub-200ms responses and your p99 is 800ms for 1% of traffic, that's a product problem — and it's invisible without full-fidelity production tracing.
A model that performs consistently across your evaluation benchmark does not necessarily perform consistently across weeks of production traffic. Output distributions drift for reasons that have nothing to do with model updates: seasonal shifts in user language, changes in upstream system prompts that get cached incorrectly, gradual shifts in the user population as a product grows, and model provider-side changes that don't always come with release notes.
In our trace analysis, we found that the output length distribution for a given use case typically shifts 15-25% over a two-month period even when the model version is pinned. The semantic content shifts more subtly — harder to measure, but detectable using embedding-space comparisons against a baseline distribution established in the weeks after deployment.
These shifts are not always problems. Sometimes they reflect the model learning to handle a more mature user base. But without instrumentation, you can't tell the difference between healthy evolution and quiet degradation. Both look the same from a traditional monitoring perspective: requests come in, responses go out, no errors thrown.
Teams tend to think about AI failures in terms of the failures they've seen: the model says something factually wrong, refuses a legitimate request, or produces off-policy content. What full trace coverage reveals is that these categories aren't distributed the way intuition suggests.
The most common failure mode we observe is not dramatic refusals or egregious outputs — it's subtle inconsistency. The same user asking materially identical questions on different days gets materially different answers, with no obvious explanation. This is the failure mode most likely to erode user trust over time, and it's nearly impossible to detect without comparing traces across sessions.
The second most common pattern is context window boundary effects: behavior changes significantly when conversation history crosses certain token thresholds, often in ways that suggest the model is not gracefully handling the truncation of earlier context. Teams that set context limits without testing boundary behavior tend to discover this one from user complaints rather than monitoring.
The teams that get the most value from trace data tend to share a few practices. First, they instrument the full request-response cycle, not just the model call itself — including prompt construction, any RAG retrieval steps, tool calls in agentic systems, and the post-processing applied to model outputs before they reach users. Tracing only the inference call gives you an incomplete picture.
Second, they attach rich metadata to every trace: user segment, feature flag state, model version, system prompt hash, and any other contextual signals that might explain behavioral variation. Traces without context are hard to reason about in aggregate; traces with context let you slice behavior by the factors that actually matter.
Third, they establish baselines. The first few weeks of production data for any AI deployment should be treated as a calibration period — not because the model isn't working, but because you need a reference distribution to detect future changes against. Drift is only meaningful relative to a baseline, and a baseline requires data you don't have until you've been running for a while.
The data from production AI systems is consistently more interesting than the data from evaluation suites. It surfaces failure modes that no one anticipated, behavioral patterns that evolve over time, and the specific conditions under which models underperform. None of this is visible without instrumentation. The investment in full-fidelity tracing pays back quickly — often within the first week, when the first unexpected pattern surfaces and the team finally has a way to investigate it.
Want to start collecting production traces for your AI models? Talk to the Starseer team — setup takes under an hour and you'll have your first insights the same day.