Skip to main content

Managing Token Limits in AI Conversations

When working with large language model (LLM) APIs such as OpenAI’s gpt-4, each API call is subject to a maximum token limit (e.g., 8192 or 128k tokens). As conversations grow longer, especially when maintaining history for context, you may quickly approach or exceed these limits.

This guide explains why token limits are important and provides practical strategies for maintaining context while staying within token constraints.


Why Token Limits Matter

  • All messages in the conversation history (user and assistant) consume tokens.
  • If the total token count exceeds the model’s limit, the API call will fail.
  • The max_tokens parameter determines how much space is reserved for the model’s response.

Example:
If you use GPT-4 (8192 tokens max) and your prompt already uses 7800 tokens, you will only receive a very short response, or the call may fail.


Strategies for Handling Token Limits

1. Truncate Old Messages

Remove the earliest messages from the conversation history (except the system message) to make room for new content.

while token_count(messages + [new_prompt]) > MAX_TOKENS - reserved_for_response:
# Remove the oldest user/assistant message
messages.pop(1)
  • Pros: Simple to implement
  • Cons: Early context is lost

2. Summarize Old Messages

Before removing messages, ask the model to summarize them. Replace the removed chunk with a summary message.

[
{ "role": "system", "content": "You are David Goggins..." },
{ "role": "assistant", "content": "Summary: user talked about weakness, pain, and you told him to use pain as fuel." },
{ "role": "user", "content": "I'm struggling again. What should I do?" }
]
  • Pros: Retains intent and memory
  • Cons: Requires summarization logic or an extra API call

  • Keep the most recent 5–10 messages in full
  • Summarize older parts into a single message
  • Use both raw and summarized context

Best for: Long-form chats, character roleplay, coaching bots


4. Retrieval-Augmented Memory (Advanced)

  • Store all past conversations externally (e.g., in a vector database)
  • Retrieve the most relevant parts for each new request
  • Inject only those snippets into the prompt

Tools: FAISS, Chroma, LangChain, LlamaIndex

  • Pros: Best for long-term scaling
  • Cons: Requires backend and indexing logic

Choosing the Right Approach

StrategyContext RetentionSimplicityScalabilityUse Case
Delete old messagesLowSimpleLimitedShort tasks, transactional queries
Summarize & replaceHighMediumGoodCoaching, journaling, persona-based chats
Hybrid (raw + summary)ExcellentModerateBestLong-term assistants, memory-enhanced bots
Retrieval-AugmentedExcellentComplexBestScalable apps with memory or knowledge lookup

Example: Summarize Then Continue

When conversation history grows too long:

  1. Summarize past messages
  2. Insert the summary as an assistant message
  3. Continue with new prompts
messages = [
{ "role": "system", "content": "You are David Goggins..." },
{ "role": "assistant", "content": "Summary of previous 50 messages..." },
{ "role": "user", "content": "I'm slipping again. Help me get back." }
]

Final Tips

  • Always reserve max_tokens for the assistant's reply (e.g., 300–1000 tokens)
  • Use libraries like tiktoken to count tokens before sending
  • Refresh summaries every 20–30 messages to maintain context quality