Incident Report: Encrypted Content Failures in Multi-Region Responses API Load Balancing
Date: Feb 24, 2026
Duration: Ongoing (until fix deployed)
Severity: High (for users load balancing Responses API across different API keys)
Status: Resolved
Summary
When load balancing OpenAI's Responses API across deployments with different API keys (e.g., different Azure regions or OpenAI organizations), follow-up requests containing encrypted content items (like rs_... reasoning items) would fail with:
{
"error": {
"message": "The encrypted content for item rs_0d09d6e56879e76500699d6feee41c8197bd268aae76141f87 could not be verified. Reason: Encrypted content organization_id did not match the target organization.",
"type": "invalid_request_error",
"code": "invalid_encrypted_content"
}
}
Encrypted content items are cryptographically tied to the API key's organization that created them. When the router load balanced a follow-up request to a deployment with a different API key, decryption failed.
- Responses API calls with encrypted content: Complete failure when routed to wrong deployment
- Initial requests: Unaffected — only follow-up requests containing encrypted items failed
- Other API endpoints: No impact — chat completions, embeddings, etc. functioned normally
Background
OpenAI's Responses API can return encrypted "reasoning items" (with IDs like rs_...) that contain intermediate reasoning steps. These items are encrypted with the organization's key and can only be decrypted by the same organization's API key.
When load balancing across deployments with different API keys, the existing affinity mechanisms were insufficient:
responses_api_deployment_check: Requiresprevious_response_idwhich some clients (like Codex) don't providedeployment_affinity: Too broad — pins all requests from a user to one deployment, reducing effective quota by the number of userssession_affinity: Requires explicit session IDs and still reduces quota
Root Cause
LiteLLM's router had no mechanism to track which deployment created specific encrypted content items and route follow-up requests accordingly. The router treated all deployments as interchangeable, leading to decryption failures when encrypted content crossed organizational boundaries.
The Problem Flow:
- User calls
router.aresponses()with modelgpt-5.1-codex - Router load balances to Deployment A (Azure East US, API Key 1)
- Response contains encrypted reasoning item
rs_abc123(encrypted with Org 1's key) - User makes follow-up request with
rs_abc123in the input - Router load balances to Deployment B (Azure West Europe, API Key 2)
- Deployment B tries to decrypt
rs_abc123with Org 2's key → fails
Why Existing Solutions Didn't Work:
previous_response_id: Not provided by all clients (e.g., Codex)deployment_affinity: Pins all user requests to one deployment → reduces quota to 1/N where N = number of deploymentssession_affinity: Requires explicit session management and still reduces quota
Timeline:
- Users configured multi-region Responses API load balancing with different API keys
- Initial requests succeeded, but follow-up requests with encrypted content failed intermittently
- Error rate correlated with number of deployments (more deployments = higher chance of routing to wrong one)
- Investigation revealed encrypted content was organization-bound
- Existing affinity mechanisms deemed unsuitable (quota reduction, missing
previous_response_id) - New solution designed and implemented:
encrypted_content_affinity
The Fix
Implemented a new encrypted_content_affinity pre-call check that intelligently tracks encrypted content and routes follow-up requests only when necessary.
Implementation
1. Encoding model_id into output items (responses/utils.py)
The same approach used for previous_response_id affinity — no cache needed. When a response contains output items with encrypted_content, LiteLLM encodes the originating deployment's model_id in two places for redundancy:
- Into the item ID (if present):
rs_abc123→encitem_{base64("litellm:model_id:{model_id};item_id:rs_abc123")} - Into the encrypted_content itself: Wraps the content with
litellm_enc:{base64("model_id:{model_id}")};{original_encrypted_content}
# Encoding item IDs (when present)
def _build_encrypted_item_id(model_id: str, item_id: str) -> str:
assembled = f"litellm:model_id:{model_id};item_id:{item_id}"
encoded = base64.b64encode(assembled.encode("utf-8")).decode("utf-8")
return f"encitem_{encoded}"
# Wrapping encrypted_content (always, for redundancy)
def _wrap_encrypted_content_with_model_id(encrypted_content: str, model_id: str) -> str:
metadata = f"model_id:{model_id}"
encoded_metadata = base64.b64encode(metadata.encode("utf-8")).decode("utf-8")
return f"litellm_enc:{encoded_metadata};{encrypted_content}"
Why wrap encrypted_content directly? Some clients (like Codex) don't consistently send item IDs in follow-up requests, but they always send the encrypted_content itself. By embedding model_id into the content, affinity works even when IDs are missing.
Streaming responses: The wrapping logic is applied to both:
- Final response objects (non-streaming)
- Individual streaming events (
response.output_item.added,response.output_item.done)
This ensures clients receiving streaming responses get wrapped content they can send back.
Before forwarding to the upstream provider, LiteLLM restores the original item IDs and unwraps encrypted_content so the provider never sees the encoded form:
# In responses/main.py — before calling the handler
input = ResponsesAPIRequestUtils._restore_encrypted_content_item_ids_in_input(input)
2. EncryptedContentAffinityCheck — routing only (encrypted_content_affinity_check.py)
No async_log_success_event or cache lookups — the model_id is decoded directly from the item ID or encrypted_content:
class EncryptedContentAffinityCheck(CustomLogger):
async def async_filter_deployments(self, model, healthy_deployments, ...):
"""Extract model_id from input items (ID or encrypted_content) and pin to that deployment."""
for item in request_kwargs.get("input", []):
# Try to extract model_id from two sources:
model_id = self._extract_model_id_from_input(item)
if model_id:
deployment = self._find_deployment_by_model_id(
healthy_deployments, model_id
)
if deployment:
request_kwargs["_encrypted_content_affinity_pinned"] = True
return [deployment]
return healthy_deployments
def _extract_model_id_from_input(self, item: dict) -> Optional[str]:
"""Extract model_id from either encoded ID or wrapped encrypted_content."""
# 1. Try decoding from item ID (if present)
item_id = item.get("id", "")
if item_id:
decoded = ResponsesAPIRequestUtils._decode_encrypted_item_id(item_id)
if decoded:
return decoded["model_id"]
# 2. Try unwrapping from encrypted_content (fallback for clients that omit IDs)
encrypted_content = item.get("encrypted_content", "")
if encrypted_content and encrypted_content.startswith("litellm_enc:"):
model_id, _ = ResponsesAPIRequestUtils._unwrap_encrypted_content_with_model_id(
encrypted_content
)
return model_id
return None
3. Rate Limit Bypass (router.py)
When encrypted content requires a specific deployment, RPM/TPM limits are bypassed (the request would fail on any other deployment anyway):
# In async_get_available_deployment, after filtering healthy deployments:
if (
request_kwargs.get("_encrypted_content_affinity_pinned")
and len(healthy_deployments) == 1
):
return healthy_deployments[0] # Bypass routing strategy (RPM/TPM checks)
3. Configuration
router_settings:
routing_strategy: usage-based-routing-v2
enable_pre_call_checks: true
optional_pre_call_checks:
- encrypted_content_affinity
deployment_affinity_ttl_seconds: 86400 # 24 hours
Key Benefits
✅ No quota reduction: Only pins requests containing encrypted items
✅ Bypasses rate limits: When encrypted content requires a specific deployment, RPM/TPM limits don't block it
✅ No previous_response_id required: Works by encoding model_id directly into the item ID
✅ No cache required: model_id is decoded on-the-fly from the item ID — no Redis, no TTL
✅ Globally safe: Can be enabled for all models; non-Responses-API calls are unaffected
✅ Surgical precision: Normal requests continue to load balance freely
Remediation
| # | Action | Status | Code |
|---|---|---|---|
| 1 | Encode model_id into encrypted-content item IDs on response | ✅ Done | responses/utils.py |
| 2 | Restore original item IDs before forwarding to upstream provider | ✅ Done | responses/main.py |
| 3 | EncryptedContentAffinityCheck: decode item IDs to route (no cache) | ✅ Done | encrypted_content_affinity_check.py |
| 4 | Add encrypted_content_affinity to OptionalPreCallChecks type | ✅ Done | types/router.py |
| 5 | Implement rate limit bypass for affinity-pinned requests | ✅ Done | router.py |
| 6 | Unit tests: encoding/decoding utilities, routing, RPM bypass | ✅ Done | test_encrypted_content_affinity_check.py |
| 7 | Documentation: Responses API guide, load balancing guide, config reference | ✅ Done | Docs |
| 8 | [Mar 3] Fix streaming events to wrap encrypted_content | ✅ Done | responses/streaming_iterator.py |
Follow-up Fix: Streaming Responses (Mar 3, 2026)
The Issue
After the initial fix was deployed, users reported that the invalid_encrypted_content error still occurred when using streaming responses with clients like Codex. Investigation revealed:
- ✅ Non-streaming responses:
encrypted_contentwas correctly wrapped withlitellm_enc:prefix - ❌ Streaming responses: Individual
response.output_item.addedandresponse.output_item.doneevents contained raw, unwrappedencrypted_content
Since Codex and other clients consume responses as streams, they received unwrapped content in these events and sent it back in follow-up requests, causing the affinity check to fail.
The Root Cause
The _update_encrypted_content_item_ids_in_response function only modified the final response object, which is used for non-streaming responses. For streaming responses, individual chunks are processed by ResponsesAPIStreamingIterator._process_chunk, which was not applying the wrapping logic to streaming events.
The Fix
Modified litellm/litellm/responses/streaming_iterator.py to wrap encrypted_content in streaming events:
# In ResponsesAPIStreamingIterator._process_chunk
if (
self.litellm_metadata
and self.litellm_metadata.get("encrypted_content_affinity_enabled")
):
event_type = getattr(openai_responses_api_chunk, "type", None)
if event_type in (
ResponsesAPIStreamEvents.OUTPUT_ITEM_ADDED,
ResponsesAPIStreamEvents.OUTPUT_ITEM_DONE,
):
item = getattr(openai_responses_api_chunk, "item", None)
if item:
encrypted_content = getattr(item, "encrypted_content", None)
if encrypted_content and isinstance(encrypted_content, str):
model_id = (
self.litellm_metadata.get("model_info", {}).get("id")
if self.litellm_metadata
else None
)
if model_id:
wrapped_content = ResponsesAPIRequestUtils._wrap_encrypted_content_with_model_id(
encrypted_content, model_id
)
setattr(item, "encrypted_content", wrapped_content)
This ensures that all encrypted_content sent to clients (streaming or non-streaming) is wrapped with model_id metadata, enabling consistent affinity routing.
Migration Guide
Before (Using deployment_affinity)
router_settings:
optional_pre_call_checks:
- deployment_affinity # ❌ Reduces quota by number of users
Problem: All requests from a user pin to one deployment, reducing effective quota to 1/N.
After (Using encrypted_content_affinity)
router_settings:
optional_pre_call_checks:
- encrypted_content_affinity # ✅ Only pins requests with encrypted content
Benefit: Normal requests load balance freely, only encrypted content requests pin when necessary.


