Benchmarking AI Agent Memory: Is a Filesystem All You Need?
Letta agents achieve 74% accuracy on the LoCoMo memory benchmark using simple filesystem operations instead of specialized memory tools. This challenges assumptions about what makes effective agent memory.
The Memory Tool Problem
AI agents need long-term memory to avoid forgetting information and losing track of objectives during complex tasks. This has spawned numerous specialized memory tools promising better retrieval through knowledge graphs and vector databases.
But evaluating these tools proves difficult. Memory effectiveness depends more on how well agents use tools than on the tools themselves. A theoretically superior search tool fails if the agent cannot call it properly.
Testing Simple Filesystem Operations
Letta recently added filesystem support, allowing agents to attach files and use basic operations:
grepfor text matchingsearch_filesfor semantic searchopenandclosefor file access
We tested this approach on LoCoMo, a question-answering benchmark using long conversations between fictional speakers. Instead of specialized memory tools, we simply placed conversation histories into files.
The agent runs on GPT-4o mini with minimal constraints: start with search_files, continue searching until ready to call answer_question. The agent decides what to search for and how many iterations to perform.
Result: 74.0% accuracy - significantly above the 68.5% reported by specialized memory tool Mem0.
Why Filesystem Beats Specialized Tools
Agents excel at using familiar tools likely present in their training data. Filesystem operations appear extensively in coding datasets, making them natural for LLMs to understand and execute.
Agents can:
- Generate better search queries than the original questions
- Transform “How does Calvin stay motivated when faced with setbacks?” into “Calvin motivation setbacks”
- Continue searching iteratively until finding relevant information
Specialized memory tools, while potentially more sophisticated, may be harder for agents to use effectively due to limited training exposure.
Agent Capabilities Matter Most
Effective memory depends on successful information retrieval when needed. This makes agent tool-use capabilities more important than specific retrieval mechanisms.
Simple tools offer advantages:
- Higher likelihood of appearing in training data
- Better agent understanding and execution
- More reliable performance across different models
Complex solutions like knowledge graphs may help in specific domains but often sacrifice usability for theoretical improvements.
Better Memory Evaluation
Current benchmarks like LoCoMo focus on retrieval rather than dynamic memory management. The Letta Memory Benchmark provides better evaluation by:
- Keeping framework and tools constant
- Testing memory interactions dynamically
- Comparing model capabilities directly
Task-based evaluation also proves valuable. Terminal-Bench measures agent performance on long-running tasks requiring memory to track state and progress.
Implementation Guidance
Start with simple filesystem tools before adding complexity:
- Use basic file operations (
grep,search,open) - Let agents generate their own search queries
- Allow iterative searching until information is found
- Add specialized tools only when simple approaches fail
Next Steps
Test your agent’s memory capabilities using the Letta Memory Benchmark or build agents with filesystem tools on the Letta Platform.
Simple filesystem operations often outperform specialized memory tools because agent capabilities matter more than retrieval mechanisms.