Problem
Context overflow is a silent killer of agent reliability. When accumulated conversation history exceeds the model's context window:
- API errors: Requests fail with
context_length_exceededor similar errors - Manual intervention: Operators must truncate transcripts, losing valuable context
- Retry complexity: Detecting overflow and retrying with compaction is error-prone
Agents need automatic compaction that preserves essential information while staying within token limits, with model-specific validation and reserve token floors to prevent immediate re-overflow.
Solution
Automatic session compaction triggered by context overflow errors, with smart reserve tokens and lane-aware retry. The system detects overflow, compacts the session transcript, validates the result, and retries the request—all transparently to the user.
Core concepts:
- Overflow detection: Catches API errors indicating context length exceeded (
context_length_exceeded,prompt is too long, etc.). - Auto-retry with compaction: On overflow, the session is compacted and the request is retried automatically.
- Reserve token floor: Post-compaction, ensures a minimum number of tokens (default 20k) remain available to prevent immediate re-overflow.
- Lane-aware compaction: Uses hierarchical lane queuing (session → global) to prevent deadlocks during compaction.
- Post-compaction verification: Estimates token count after compaction and verifies it's less than the pre-compaction count.
- Model-specific validation: Anthropic models require strict turn ordering; Gemini models have different transcript requirements.
Implementation sketch:
async function compactEmbeddedPiSession(params: {
sessionFile: string;
config?: Config;
}): Promise<CompactResult> {
// 1. Load session and configure reserve tokens
const sessionManager = SessionManager.open(params.sessionFile);
const settingsManager = SettingsManager.create(workspaceDir, agentDir);
// Ensure minimum reserve tokens (default 20k)
ensurePiCompactionReserveTokens({
settingsManager,
minReserveTokens: resolveCompactionReserveTokensFloor(params.config),
});
// 2. Sanitize session history for model API
const prior = sanitizeSessionHistory({
messages: session.messages,
modelApi: model.api,
modelId,
provider,
sessionManager,
});
// 3. Model-specific validation
const validated = provider === "anthropic"
? validateAnthropicTurns(prior)
: validateGeminiTurns(prior);
// 4. Compact the session
const result = await session.compact(customInstructions);
// 5. Estimate tokens after compaction
let tokensAfter: number | undefined;
try {
tokensAfter = 0;
for (const message of session.messages) {
tokensAfter += estimateTokens(message);
}
// Sanity check: tokensAfter should be less than tokensBefore
if (tokensAfter > result.tokensBefore) {
tokensAfter = undefined; // Don't trust the estimate
}
} catch {
tokensAfter = undefined;
}
return {
ok: true,
compacted: true,
result: {
summary: result.summary,
tokensBefore: result.tokensBefore,
tokensAfter,
},
};
}
Reserve token enforcement:
const DEFAULT_PI_COMPACTION_RESERVE_TOKENS_FLOOR = 20_000;
function ensurePiCompactionReserveTokens(params: {
settingsManager: SettingsManager;
minReserveTokens?: number;
}): { didOverride: boolean; reserveTokens: number } {
const minReserveTokens = params.minReserveTokens ?? DEFAULT_PI_COMPACTION_RESERVE_TOKENS_FLOOR;
const current = params.settingsManager.getCompactionReserveTokens();
if (current >= minReserveTokens) {
return { didOverride: false, reserveTokens: current };
}
// Override to ensure minimum floor
params.settingsManager.applyOverrides({
compaction: { reserveTokens: minReserveTokens },
});
return { didOverride: true, reserveTokens: minReserveTokens };
}
API-based compaction (OpenAI Responses API):
Some providers offer dedicated compaction endpoints that are more efficient than manual summarization:
// OpenAI's /responses/compact endpoint
const compacted = await responsesAPI.compact({
messages: currentMessages,
});
// Returns a list of items that includes:
// - A special type=compaction item with encrypted_content
// that preserves the model's latent understanding
// - Condensed conversation items
currentMessages = compacted.items;
This approach has advantages:
- Preserves latent understanding: The
encrypted_contentmaintains the model's compressed representation of the original conversation - More efficient: Server-side compaction is faster than client-side summarization
- Auto-compaction: Can trigger automatically when
auto_compact_limitis exceeded
Two complementary approaches:
This pattern describes reactive compaction (detect overflow, compact, retry). An alternative approach is preventive filtering (reduce context at ingestion), used by systems like HyperAgent for browser accessibility tree extraction. Preventive filtering can delay or eliminate the need for reactive compaction by keeping context leaner from the start.
Lane-aware retry to prevent deadlocks:
// Compaction runs through session lane, then global lane
async function compactEmbeddedPiSession(params: CompactParams): Promise<CompactResult> {
const sessionLane = resolveSessionLane(params.sessionKey);
const globalLane = resolveGlobalLane(params.lane);
return enqueueCommandInLane(sessionLane, () =>
enqueueCommandInLane(globalLane, () =>
compactEmbeddedPiSessionDirect(params) // Core compaction logic
)
);
}
How to use it
- Configure reserve floor: Set
compaction.reserveTokensFloorto ensure headroom after compaction (default 20k). - Handle overflow errors: Catch API errors, detect overflow via error message matching, then trigger compaction.
- Validate transcripts: Apply model-specific validation (Anthropic turns, Gemini ordering) before retry.
- Estimate post-compaction tokens: Verify that compaction actually reduced token count before retrying.
- Use lane queuing: Run compaction through hierarchical lanes to avoid deadlocks with concurrent operations.
Pitfalls to avoid:
- Aggressive floor setting: Reserve tokens too high may leave insufficient room for actual conversation content.
- Missing model validation: Skipping model-specific transcript validation can cause API errors on retry.
- Token estimation drift: Estimation heuristics may diverge from actual token counts; treat estimates as sanity checks only.
- Infinite compaction loops: If compaction fails to reduce tokens, avoid infinite retry loops. Max out at 1-2 attempts.
Trade-offs
Pros:
- Transparent recovery: Overflow errors are handled automatically without user intervention.
- Preserve essential context: Compaction generates summaries rather than arbitrary truncation.
- Prevents re-overflow: Reserve token floor ensures immediate re-overflow is unlikely.
- Model-aware: Different validation rules per provider ensure API compatibility.
Cons/Considerations:
- Summary quality: Auto-generated summaries may lose nuanced details that manual curation would preserve.
- Latency penalty: Compaction and retry adds overhead (seconds to minutes depending on context size).
- Token estimation errors: Heuristics may misestimate actual token counts, leading to failed retries.
- Complexity: Lane-aware queuing and model-specific validation increase implementation complexity.
References
- Clawdbot compact.ts - Compaction orchestration
- Clawdbot pi-settings.ts - Reserve token configuration
- Clawdbot context-window-guard.ts - Context evaluation
- Pi Coding Agent SessionManager - Core compaction logic
- Unrolling the Codex agent loop | OpenAI Blog - API-based
/responses/compactendpoint approach - Efficient Transformer-Based Long-Form Dialogue Summarization - Liu & Lapata (ACL 2022): 60-80% token reduction via extractive-then-abstractive summarization
- Related: Context Window Anxiety Management for proactive management
- Related: Prompt Caching via Exact Prefix Preservation