Context Engineering at the Gateway Layer

LLM gateways and routers make context engineering decisions before the application even sees the request. Model selection, context compression, cache routing, and cost optimization all happen at this layer, and most teams don't think of them as context engineering.

The Infrastructure Perspective

Most context engineering writing focuses on what goes into the prompt: how to structure the system message, which documents to retrieve, how to manage conversation history. But there’s a layer below the application that makes context engineering decisions too: the LLM gateway or router that sits between your application and the model provider.

This layer decides which model processes the request, whether the context can be served from cache, how much the request costs, and whether the context should be compressed before it reaches the model. These are context engineering decisions with real impact on quality and cost, but they’re usually treated as infrastructure concerns separate from the prompt engineering work happening in the application.

Model Routing as Context Engineering

Different models handle context differently. A 32k-token context that works well on Claude Sonnet might degrade on a smaller model. A simple classification task with 500 tokens of context doesn’t need a frontier model at all. The routing decision (which model gets this request) is implicitly a context engineering decision: you’re choosing how your context will be processed.

Context-complexity routing: Route requests based on context characteristics as well as task type. A request with 2k tokens of context and a simple question can go to a cheaper, faster model. The same question with 30k tokens of context and cross-referencing requirements needs a frontier model. The routing logic examines the context itself to make this decision.

def route_request(messages, tools):
    context_tokens = estimate_tokens(messages)
    has_tools = len(tools) > 0
    requires_reasoning = detect_complexity(messages)
    
    if context_tokens < 4000 and not requires_reasoning:
        return "claude-haiku"
    elif context_tokens < 32000:
        return "claude-sonnet"
    else:
        return "claude-opus"

Production routing logic is more nuanced, but the principle holds: the context’s size and complexity determine which model processes it, and that determination is a context engineering decision even though it happens at the infrastructure layer.

Fallback routing: When a model returns a context-length error or times out on a long context, the gateway can automatically retry with a model that handles longer contexts, or compress the context and retry on the same model. This is context management happening outside the application’s awareness.

Gateway-Level Caching

The Context Caching and Anchor Turn patterns describe how to structure prompts for cache efficiency. The gateway layer is where cache routing decisions actually happen, and the gateway can make or break a caching strategy that the application designed carefully.

Prefix-aware caching: Anthropic and Google both offer prompt caching that works on shared prefixes. If your system prompt is identical across requests, the gateway can ensure those requests hit the cache by keeping the prefix stable and varying only the suffix. But a gateway that reformats messages, reorders tool definitions, or injects its own instrumentation into the prompt can break the prefix match and kill your cache hit rate.

Semantic caching: Some gateways cache responses for semantically similar queries, serving a cached response when a new query is close enough to a previous one. This is a context engineering decision made entirely at the gateway: the application sends a fresh query, and the gateway decides the context is similar enough to a previous request that the cached response applies. The quality impact depends entirely on how “similar enough” is defined and whether the context differences that the gateway ignores actually matter for the response.

The risk follows directly: semantic caching treats context as fungible, returning the same answer for different-but-similar contexts. For some use cases (FAQ responses, documentation lookups) this is fine. For others (personalized recommendations, code generation with project-specific context) it silently degrades quality by ignoring context differences that matter.

Cost Optimization Through Context

The Context Tax paper documented that Llama-3.1-70B showed a 719% latency increase at 15k-word context, while accuracy only dropped from 98.5% to 98%. The operational cost of long context is dominated by memory bandwidth and compute time. Accuracy loss is comparatively small, which means cost optimization at the gateway layer can be significant without meaningful quality tradeoffs.

Context compression at the gateway: Before sending a request to the model, the gateway can compress the context by removing redundant content, trimming conversation history to the most recent N turns, or summarizing earlier portions. This reduces token costs and latency. The trade-off: you’re applying Compress & Restart at the infrastructure layer, with the advantage of doing it transparently and the disadvantage of doing it without task-specific knowledge.

Budget enforcement: The gateway can enforce a maximum context size per request, rejecting or truncating requests that exceed the budget. This is a blunt instrument, but it prevents runaway costs from applications that assemble context without budget awareness. The Context Budget pattern works better when enforced at the infrastructure layer because individual applications can’t accidentally exceed it.

Token accounting: Gateways that track token usage per request, per user, and per application provide the data needed to optimize context engineering at the application level. If you can see that 60% of your tokens go to conversation history and 10% to the system prompt, you know where compression will have the most impact. This is observability for context engineering.

Common Mistakes

Routing only on task type: Classifying requests as “simple” or “complex” based on the user’s question while ignoring the context that accompanies it. A simple question with 50k tokens of RAG context is not a simple request.

Breaking cache prefixes: Gateway middleware that modifies the prompt (adding tracking headers, rewriting system messages, reordering tools) breaks prompt caching without realizing it. Cache-aware gateways need to treat the prompt prefix as immutable.

Aggressive semantic caching: Caching responses for queries that look similar but have different context. Two requests about “return policy” from different users with different order histories should not get the same cached response.

No budget enforcement: Relying on application developers to manage context size. Without gateway-level limits, one misconfigured RAG pipeline can generate requests with 100k+ tokens that cost 10x what they should.