Context Management¶
PatchPal automatically manages the context window to prevent "input too long" errors during long coding sessions.
Features:
- Automatic token tracking: Monitors context usage in real-time
- Smart pruning: Removes old tool outputs (keeps last 40k tokens) before resorting to full compaction
- Auto-compaction: Summarizes conversation history when approaching 75% capacity
- Manual control: Check status with /status, compact with /compact, prune with /prune
Commands:
# Check context window usage
You: /status
# Output shows:
# - Messages in history
# - Token usage breakdown
# - Visual progress bar
# - Auto-compaction status
# - Session statistics:
# - Total LLM calls made
# - Cumulative input tokens (all requests combined)
# - Cumulative output tokens (all responses combined)
# - Total tokens (helps estimate API costs)
# Manually trigger compaction
You: /compact
# Useful when:
# - You want to free up context space before a large operation
# - Testing compaction behavior
# - Context is getting full but hasn't auto-compacted yet
# Note: Requires at least 5 messages; most effective when context >50% full
# Manually prune old tool outputs
You: /prune
# Useful when:
# - Large tool outputs (e.g., from grep, file reads) are filling context
# - You want to reclaim space without full compaction
# - Testing pruning behavior
# Note: Keeps last 2 conversational turns; prunes all older tool outputs
Understanding Session Statistics:
The /status command shows cumulative token usage:
- Cumulative input tokens: Total tokens sent to the LLM across all calls
- Each LLM call resends the entire conversation history
-
Note on Anthropic models: PatchPal uses prompt caching
- System prompt and last 2 messages are cached
- Cached tokens cost much less than regular input tokens
- The displayed token counts show raw totals, not cache-adjusted costs
-
Cumulative output tokens: Total tokens generated by the LLM
- Usually much smaller than input (just the generated responses)
- Typically costs more per token than input
Important: The token counts shown are raw totals and don't reflect prompt caching discounts. For accurate cost information, check your provider's usage dashboard which shows cache hits and actual billing.
Configuration:
See the Configuration section for context management settings including:
- PATCHPAL_DISABLE_AUTOCOMPACT - Disable auto-compaction
- PATCHPAL_COMPACT_THRESHOLD - Adjust compaction threshold
- PATCHPAL_CONTEXT_LIMIT - Override context limit for testing
- PATCHPAL_PROACTIVE_PRUNING - Prune tool outputs proactively after calls (default: true, uses smart summarization)
- PATCHPAL_PRUNE_PROTECT / PATCHPAL_PRUNE_MINIMUM - Pruning controls
Testing Context Management:
You can test the context management system with small values to trigger compaction quickly:
# Set up small context window for testing
export PATCHPAL_CONTEXT_LIMIT=10000 # Force 10k token limit (instead of 200k for Claude)
export PATCHPAL_COMPACT_THRESHOLD=0.75 # Trigger at 75% (default, but shown for clarity)
# Note: System prompt + output reserve = ~6.4k tokens baseline
# So 75% of 10k = 7.5k, leaving ~1k for conversation
export PATCHPAL_PRUNE_PROTECT=500 # Keep only last 500 tokens of tool outputs
export PATCHPAL_PRUNE_MINIMUM=100 # Prune if we can save 100+ tokens
# Start PatchPal and watch it compact quickly
patchpal
# Generate context with tool calls (tool outputs consume tokens)
You: list all python files
You: read patchpal/agent.py
You: read patchpal/cli.py
# Check status - should show compaction happening
You: /status
# Continue - should see pruning messages
You: search for "context" in all files
# You should see:
# ⚠️ Context window at 75% capacity. Compacting...
# Pruned old tool outputs (saved ~400 tokens)
# ✓ Compaction complete. Saved 850 tokens (75% → 58%)
How It Works:
- Phase 1 - Pruning: When context fills up, old tool outputs are pruned first
- Keeps last 40k tokens of tool outputs protected (only tool outputs, not conversation)
- Only prunes if it saves >20k tokens
- Pruning is transparent and fast
-
Requires at least 5 messages in history
-
Phase 2 - Compaction: If pruning isn't enough, full compaction occurs
- Requires at least 5 messages to be effective
- LLM summarizes the entire conversation
- Summary replaces old messages, keeping last 2 complete conversation turns
- Work continues seamlessly from the summary
- Preserves complete tool call/result pairs (important for Bedrock compatibility)
Example:
Context Window Status
======================================================================
Model: anthropic/claude-sonnet-4-5
Messages in history: 47
System prompt: 15,234 tokens
Conversation: 142,567 tokens
Output reserve: 4,096 tokens
Total: 161,897 / 200,000 tokens
Usage: 80%
[████████████████████████████████████████░░░░░░░░░]
Auto-compaction: Enabled (triggers at 75%)
======================================================================
The system ensures you can work for extended periods without hitting context limits.