When Chatbots See Dead Contexts
Solving the Last 10% of Context Simulation Before Users Do
This is my Roman Empire. I’m not even proud of it, but I think about the absurdity of the Air Canada chatbot case more than I should. It keeps coming back to me because it’s a perfect example of a specific technical problem I’ve been trying to solve.
Jake Moffatt’s grandmother died. He went to Air Canada’s site to ask about bereavement fares. The chatbot told him he could book a regular ticket now and get a partial refund later as long as he applies within 90 days. So that’s what he did. Then he tried to get his refund. Air Canada said no. The chatbot was wrong. He took them to the British Columbia Civil Resolution Tribunal. Air Canada argued the chatbot was a “separate legal entity” responsible for its own actions. They lost. They had to pay $812.02.
This wasn’t a hallucination problem. The chatbot had access to the correct policy documents. The policy required requesting bereavement fares BEFORE traveling, not after. This was a context problem. The correct information existed but the bot couldn’t figure out which piece of context to trust.
I kept thinking about what happens in production when a customer asks about a refund policy. The system has memory from three weeks ago when the customer asked about the old 60-day policy. The current documentation says 30 days. The chat history shows the customer is upset because they read an old blog post saying 60 days. The customer’s account shows they’re a premium member, which used to mean extended returns. All of this context loads into the prompt at once like a messy group chat that refuses to die. Which one should the bot trust?
Most testing frameworks let you add context as text. They don’t let you specify “this context is 3 weeks old with 0.7 confidence from memory, while this context is current with 0.95 confidence from documentation.” That’s the gap I wanted to fill.
I started building a Context Simulator framework and had to reimagine how to leverage Model Context Protocol for test frameworks. MCP formalizes context structure with metadata schemas, confidence scores, and source hierarchies. I built this simulator with a SQLite database that tracks user memories, chat history, and file contexts. You can add conflicting information across these different sources and test how your prompt handles the conflicts. that tracks user memories, chat history, and file contexts. It’s not magic, but its a small step in the right direction.
Under the hood, the framework maintains three tables for persistent state. Memory table tracks user preferences and past interactions. Chat history table maintains conversation continuity. File context table tracks document versions. The metadata fields in the schema store confidence scores (0.8 vs 0.9), sources (previous_conversation vs documentation), and timestamps. This isn’t storing context. This is storing structured, queryable context state.
Here’s what using it looks like:
simulator.add_memory(”user123”, “session456”,
“customer_preference”,
“Customer prefers email communication”)
simulator.add_chat_message(”user123”, “session456”,
“user”, “I need help with my order”)
simulator.add_file_context(”user123”, “refund_policy.pdf”,
policy_text, “pdf”)
mcp = ModelContextProtocol(simulator)
enhanced_prompt = mcp.create_context_prompt(
base_prompt, “user123”, “session456”,
context_types=[”memory”, “chat_history”, “file_context”])
This lets me test whether a prompt can handle memory plus chat history plus files all at the same time. It’s testing context management, not just prompt quality in isolation.
It gets more interesting when you add conflicts. The Air Canada case was almost certainly a conflict between different policy documents or between what the bot remembered telling customers previously and what the current policy actually says. I can test for that:
test_suite.add_memory_conflict_test(
“policy_change_handling”,
conflicting_memories=[
{”content”: “Customer was told 60-day returns”,
“confidence”: 0.8, “source”: “previous_conversation”},
{”content”: “Current policy is 30-day returns”,
“confidence”: 0.9, “source”: “documentation”}
],
expected_behavior=”Acknowledge previous information, explain policy change”)
This test explicitly checks what happens when old information conflicts with new information. Can the prompt gracefully handle being wrong about something it previously said? The simulator loads all the context for that user and session and passes it to the prompt with the proper structure. Timestamps, confidence scores, sources. I can test what happens when memory contradicts chat history, when files get updated mid-conversation, all the messy production scenarios before they happen in production.
Traditional evals check “is the answer correct?” I needed evaluators that check different things. Did it handle the context conflict gracefully? Did it maintain conversation coherence?
evaluators = [
{
“name”: “memory_resolution”,
“criteria”: “Should resolve conflicting memories appropriately”
},
{
“name”: “conversation_continuity”,
“criteria”: “Should maintain conversation continuity”
},
{
“name”: “context_awareness”,
“criteria”: “Should reference previous interactions correctly”
}
]
I’ve been using this framework on my private projects. Conflicts that would have surfaced in production after user complaints show up in testing. This helps catch scenarios like Air Canada before they reach real users. When a bot confidently states outdated information because it’s prioritizing memory over current documentation, I can see this in testing instead of in production.
When prompts fail under realistic context conflicts, there is incentive to write better instructions. Explicit hierarchies clarify which context to trust, and graceful degradation handles contradictions. The gap between “it worked in testing” and “it failed for a real user” has gotten smaller. I’m no longer testing in a sterile environment and hoping production looks the same.
🛠️ Behind the Build
I first took the lazy approach, of course, and tried to extend TrueLens and Ragas, but they’re built to measure output quality like groundedness and answer relevance rather than context conflict handling. SpellTest tests across different user personas, not context contradictions. Letta and Mem0 handle production state management well, giving agents persistent memory and conversation history. What’s missing is a testing layer for the conflicts that emerge from all that state.I spent three days building the first working version. The tech stack was surprisingly simple SQLite for persistent storage, Python 3.9+ for implementation, JSON for data serialization, and hashlib with MD5 for detecting when file content changes. No complex infrastructure needed. The hard parts were crafting conflict scenarios that felt like production rather than textbook examples, and designing the metadata schema.
The context simulation problem is the last 10% problem for stateful agents. The easy 90% is getting a prompt that works with clean inputs. The hard 10% is getting it to work when production throws conflicting memories, stale chat history, and updated files at it simultaneously. That’s the 10% that got Air Canada sued.
production_readiness = (context_simulation × state_testing) / clean_input_bias
Clean input bias is when you only test against pristine, conflict-free inputs that never happen in production. The kind of perfect test data that exists exclusively in test suites and demo videos.



