Skip to content

Context Management

PatchPal automatically manages the context window to prevent "input too long" errors during long coding sessions.

Features: - Automatic token tracking: Monitors context usage in real-time - Smart pruning: Removes old tool outputs (keeps last 40k tokens) before resorting to full compaction - Auto-compaction: Summarizes conversation history when approaching 75% capacity - Manual control: Check status with /status, compact with /compact, prune with /prune

Commands:

# Check context window usage
You: /status

# Output shows:
# - Messages in history
# - Token usage breakdown
# - Visual progress bar
# - Auto-compaction status
# - Session statistics:
#   - Total LLM calls made
#   - Cumulative input tokens (all requests combined)
#   - Cumulative output tokens (all responses combined)
#   - Total tokens (helps estimate API costs)

# Manually trigger compaction
You: /compact

# Useful when:
# - You want to free up context space before a large operation
# - Testing compaction behavior
# - Context is getting full but hasn't auto-compacted yet
# Note: Requires at least 5 messages; most effective when context >50% full

# Manually prune old tool outputs
You: /prune

# Useful when:
# - Large tool outputs (e.g., from grep, file reads) are filling context
# - You want to reclaim space without full compaction
# - Testing pruning behavior
# Note: Keeps last 2 conversational turns; prunes all older tool outputs

Understanding Session Statistics:

The /status command shows cumulative token usage:

  • Cumulative input tokens: Total tokens sent to the LLM across all calls
  • Each LLM call resends the entire conversation history
  • Note on Anthropic models: PatchPal uses prompt caching

    • System prompt and last 2 messages are cached
    • Cached tokens cost much less than regular input tokens
    • The displayed token counts show raw totals, not cache-adjusted costs
  • Cumulative output tokens: Total tokens generated by the LLM

  • Usually much smaller than input (just the generated responses)
  • Typically costs more per token than input

Important: The token counts shown are raw totals and don't reflect prompt caching discounts. For accurate cost information, check your provider's usage dashboard which shows cache hits and actual billing.

Configuration:

See the Configuration section for context management settings including: - PATCHPAL_DISABLE_AUTOCOMPACT - Disable auto-compaction - PATCHPAL_COMPACT_THRESHOLD - Adjust compaction threshold - PATCHPAL_CONTEXT_LIMIT - Override context limit for testing - PATCHPAL_PROACTIVE_PRUNING - Prune tool outputs proactively after calls (default: true, uses smart summarization) - PATCHPAL_PRUNE_PROTECT / PATCHPAL_PRUNE_MINIMUM - Pruning controls

Testing Context Management:

You can test the context management system with small values to trigger compaction quickly:

# Set up small context window for testing
export PATCHPAL_CONTEXT_LIMIT=10000      # Force 10k token limit (instead of 200k for Claude)
export PATCHPAL_COMPACT_THRESHOLD=0.75   # Trigger at 75% (default, but shown for clarity)
                                         # Note: System prompt + output reserve = ~6.4k tokens baseline
                                         # So 75% of 10k = 7.5k, leaving ~1k for conversation
export PATCHPAL_PRUNE_PROTECT=500        # Keep only last 500 tokens of tool outputs
export PATCHPAL_PRUNE_MINIMUM=100        # Prune if we can save 100+ tokens

# Start PatchPal and watch it compact quickly
patchpal

# Generate context with tool calls (tool outputs consume tokens)
You: list all python files
You: read patchpal/agent.py
You: read patchpal/cli.py

# Check status - should show compaction happening
You: /status

# Continue - should see pruning messages
You: search for "context" in all files
# You should see:
# ⚠️  Context window at 75% capacity. Compacting...
#    Pruned old tool outputs (saved ~400 tokens)
# ✓ Compaction complete. Saved 850 tokens (75% → 58%)

How It Works:

  1. Phase 1 - Pruning: When context fills up, old tool outputs are pruned first
  2. Keeps last 40k tokens of tool outputs protected (only tool outputs, not conversation)
  3. Only prunes if it saves >20k tokens
  4. Pruning is transparent and fast
  5. Requires at least 5 messages in history

  6. Phase 2 - Compaction: If pruning isn't enough, full compaction occurs

  7. Requires at least 5 messages to be effective
  8. LLM summarizes the entire conversation
  9. Summary replaces old messages, keeping last 2 complete conversation turns
  10. Work continues seamlessly from the summary
  11. Preserves complete tool call/result pairs (important for Bedrock compatibility)

Example:

Context Window Status
======================================================================
  Model: anthropic/claude-sonnet-4-5
  Messages in history: 47
  System prompt: 15,234 tokens
  Conversation: 142,567 tokens
  Output reserve: 4,096 tokens
  Total: 161,897 / 200,000 tokens
  Usage: 80%
  [████████████████████████████████████████░░░░░░░░░]

  Auto-compaction: Enabled (triggers at 75%)
======================================================================

The system ensures you can work for extended periods without hitting context limits.