Context Window Auto-Compaction

Problem

Context overflow is a silent killer of agent reliability. When accumulated conversation history exceeds the model's context window:

API errors: Requests fail with context_length_exceeded or similar errors
Manual intervention: Operators must truncate transcripts, losing valuable context
Retry complexity: Detecting overflow and retrying with compaction is error-prone

Agents need automatic compaction that preserves essential information while staying within token limits, with model-specific validation and reserve token floors to prevent immediate re-overflow.

Solution

Automatic session compaction triggered by context overflow errors, with smart reserve tokens and lane-aware retry. The system detects overflow, compacts the session transcript, validates the result, and retries the request—all transparently to the user.

Core concepts:

Overflow detection: Catches API errors indicating context length exceeded (context_length_exceeded, prompt is too long, etc.).
Auto-retry with compaction: On overflow, the session is compacted and the request is retried automatically.
Reserve token floor: Post-compaction, ensures a minimum number of tokens (default 20k) remain available to prevent immediate re-overflow.
Lane-aware compaction: Uses hierarchical lane queuing (session → global) to prevent deadlocks during compaction.
Post-compaction verification: Estimates token count after compaction and verifies it's less than the pre-compaction count.
Model-specific validation: Anthropic models require strict turn ordering; Gemini models have different transcript requirements.

Implementation sketch:

async function compactEmbeddedPiSession(params: {
  sessionFile: string;
  config?: Config;
}): Promise<CompactResult> {
  // 1. Load session and configure reserve tokens
  const sessionManager = SessionManager.open(params.sessionFile);
  const settingsManager = SettingsManager.create(workspaceDir, agentDir);

  // Ensure minimum reserve tokens (default 20k)
  ensurePiCompactionReserveTokens({
    settingsManager,
    minReserveTokens: resolveCompactionReserveTokensFloor(params.config),
  });

  // 2. Sanitize session history for model API
  const prior = sanitizeSessionHistory({
    messages: session.messages,
    modelApi: model.api,
    modelId,
    provider,
    sessionManager,
  });

  // 3. Model-specific validation
  const validated = provider === "anthropic"
    ? validateAnthropicTurns(prior)
    : validateGeminiTurns(prior);

  // 4. Compact the session
  const result = await session.compact(customInstructions);

  // 5. Estimate tokens after compaction
  let tokensAfter: number | undefined;
  try {
    tokensAfter = 0;
    for (const message of session.messages) {
      tokensAfter += estimateTokens(message);
    }
    // Sanity check: tokensAfter should be less than tokensBefore
    if (tokensAfter > result.tokensBefore) {
      tokensAfter = undefined;  // Don't trust the estimate
    }
  } catch {
    tokensAfter = undefined;
  }

  return {
    ok: true,
    compacted: true,
    result: {
      summary: result.summary,
      tokensBefore: result.tokensBefore,
      tokensAfter,
    },
  };
}

Reserve token enforcement:

const DEFAULT_PI_COMPACTION_RESERVE_TOKENS_FLOOR = 20_000;

function ensurePiCompactionReserveTokens(params: {
  settingsManager: SettingsManager;
  minReserveTokens?: number;
}): { didOverride: boolean; reserveTokens: number } {
  const minReserveTokens = params.minReserveTokens ?? DEFAULT_PI_COMPACTION_RESERVE_TOKENS_FLOOR;
  const current = params.settingsManager.getCompactionReserveTokens();

  if (current >= minReserveTokens) {
    return { didOverride: false, reserveTokens: current };
  }

  // Override to ensure minimum floor
  params.settingsManager.applyOverrides({
    compaction: { reserveTokens: minReserveTokens },
  });

  return { didOverride: true, reserveTokens: minReserveTokens };
}

API-based compaction (OpenAI Responses API):

Some providers offer dedicated compaction endpoints that are more efficient than manual summarization:

// OpenAI's /responses/compact endpoint
const compacted = await responsesAPI.compact({
  messages: currentMessages,
});

// Returns a list of items that includes:
// - A special type=compaction item with encrypted_content
//   that preserves the model's latent understanding
// - Condensed conversation items

currentMessages = compacted.items;

This approach has advantages:

Preserves latent understanding: The encrypted_content maintains the model's compressed representation of the original conversation
More efficient: Server-side compaction is faster than client-side summarization
Auto-compaction: Can trigger automatically when auto_compact_limit is exceeded

Two complementary approaches:

This pattern describes reactive compaction (detect overflow, compact, retry). An alternative approach is preventive filtering (reduce context at ingestion), used by systems like HyperAgent for browser accessibility tree extraction. Preventive filtering can delay or eliminate the need for reactive compaction by keeping context leaner from the start.

Lane-aware retry to prevent deadlocks:

// Compaction runs through session lane, then global lane
async function compactEmbeddedPiSession(params: CompactParams): Promise<CompactResult> {
  const sessionLane = resolveSessionLane(params.sessionKey);
  const globalLane = resolveGlobalLane(params.lane);

  return enqueueCommandInLane(sessionLane, () =>
    enqueueCommandInLane(globalLane, () =>
      compactEmbeddedPiSessionDirect(params)  // Core compaction logic
    )
  );
}

How to use it

Configure reserve floor: Set compaction.reserveTokensFloor to ensure headroom after compaction (default 20k).
Handle overflow errors: Catch API errors, detect overflow via error message matching, then trigger compaction.
Validate transcripts: Apply model-specific validation (Anthropic turns, Gemini ordering) before retry.
Estimate post-compaction tokens: Verify that compaction actually reduced token count before retrying.
Use lane queuing: Run compaction through hierarchical lanes to avoid deadlocks with concurrent operations.

Pitfalls to avoid:

Aggressive floor setting: Reserve tokens too high may leave insufficient room for actual conversation content.
Missing model validation: Skipping model-specific transcript validation can cause API errors on retry.
Token estimation drift: Estimation heuristics may diverge from actual token counts; treat estimates as sanity checks only.
Infinite compaction loops: If compaction fails to reduce tokens, avoid infinite retry loops. Max out at 1-2 attempts.

Trade-offs

Pros:

Transparent recovery: Overflow errors are handled automatically without user intervention.
Preserve essential context: Compaction generates summaries rather than arbitrary truncation.
Prevents re-overflow: Reserve token floor ensures immediate re-overflow is unlikely.
Model-aware: Different validation rules per provider ensure API compatibility.

Cons/Considerations:

Summary quality: Auto-generated summaries may lose nuanced details that manual curation would preserve.
Latency penalty: Compaction and retry adds overhead (seconds to minutes depending on context size).
Token estimation errors: Heuristics may misestimate actual token counts, leading to failed retries.
Complexity: Lane-aware queuing and model-specific validation increase implementation complexity.

References

Clawdbot compact.ts - Compaction orchestration
Clawdbot pi-settings.ts - Reserve token configuration
Clawdbot context-window-guard.ts - Context evaluation
Pi Coding Agent SessionManager - Core compaction logic
Unrolling the Codex agent loop | OpenAI Blog - API-based /responses/compact endpoint approach
Efficient Transformer-Based Long-Form Dialogue Summarization - Liu & Lapata (ACL 2022): 60-80% token reduction via extractive-then-abstractive summarization
Related: Context Window Anxiety Management for proactive management
Related: Prompt Caching via Exact Prefix Preservation