Managing Token Limits in AI Conversations

When working with large language model (LLM) APIs such as OpenAI’s gpt-4, each API call is subject to a maximum token limit (e.g., 8192 or 128k tokens). As conversations grow longer, especially when maintaining history for context, you may quickly approach or exceed these limits.

This guide explains why token limits are important and provides practical strategies for maintaining context while staying within token constraints.

Why Token Limits Matter

All messages in the conversation history (user and assistant) consume tokens.
If the total token count exceeds the model’s limit, the API call will fail.
The max_tokens parameter determines how much space is reserved for the model’s response.

Example:
If you use GPT-4 (8192 tokens max) and your prompt already uses 7800 tokens, you will only receive a very short response, or the call may fail.

Strategies for Handling Token Limits

1. Truncate Old Messages

Remove the earliest messages from the conversation history (except the system message) to make room for new content.

while token_count(messages + [new_prompt]) > MAX_TOKENS - reserved_for_response:
    # Remove the oldest user/assistant message
    messages.pop(1)

Pros: Simple to implement
Cons: Early context is lost

2. Summarize Old Messages

Before removing messages, ask the model to summarize them. Replace the removed chunk with a summary message.

[
  { "role": "system", "content": "You are David Goggins..." },
  { "role": "assistant", "content": "Summary: user talked about weakness, pain, and you told him to use pain as fuel." },
  { "role": "user", "content": "I'm struggling again. What should I do?" }
]

Pros: Retains intent and memory
Cons: Requires summarization logic or an extra API call

3. Hybrid Strategy (Recommended)

Keep the most recent 5–10 messages in full
Summarize older parts into a single message
Use both raw and summarized context

Best for: Long-form chats, character roleplay, coaching bots

4. Retrieval-Augmented Memory (Advanced)

Store all past conversations externally (e.g., in a vector database)
Retrieve the most relevant parts for each new request
Inject only those snippets into the prompt

Tools: FAISS, Chroma, LangChain, LlamaIndex

Pros: Best for long-term scaling
Cons: Requires backend and indexing logic

Choosing the Right Approach

Strategy	Context Retention	Simplicity	Scalability	Use Case
Delete old messages	Low	Simple	Limited	Short tasks, transactional queries
Summarize & replace	High	Medium	Good	Coaching, journaling, persona-based chats
Hybrid (raw + summary)	Excellent	Moderate	Best	Long-term assistants, memory-enhanced bots
Retrieval-Augmented	Excellent	Complex	Best	Scalable apps with memory or knowledge lookup

Example: Summarize Then Continue

When conversation history grows too long:

Summarize past messages
Insert the summary as an assistant message
Continue with new prompts

messages = [
  { "role": "system", "content": "You are David Goggins..." },
  { "role": "assistant", "content": "Summary of previous 50 messages..." },
  { "role": "user", "content": "I'm slipping again. Help me get back." }
]

Final Tips

Always reserve max_tokens for the assistant's reply (e.g., 300–1000 tokens)
Use libraries like tiktoken to count tokens before sending
Refresh summaries every 20–30 messages to maintain context quality

Why Token Limits Matter​

Strategies for Handling Token Limits​

1. Truncate Old Messages​

2. Summarize Old Messages​

3. Hybrid Strategy (Recommended)​

4. Retrieval-Augmented Memory (Advanced)​

Choosing the Right Approach​

Example: Summarize Then Continue​

Final Tips​