Benchmarking AI Agent Memory: Is a Filesystem All You Need?

Letta agents achieve 74% accuracy on the LoCoMo memory benchmark using simple filesystem operations instead of specialized memory tools. This challenges assumptions about what makes effective agent memory.

The Memory Tool Problem

AI agents need long-term memory to avoid forgetting information and losing track of objectives during complex tasks. This has spawned numerous specialized memory tools promising better retrieval through knowledge graphs and vector databases.

But evaluating these tools proves difficult. Memory effectiveness depends more on how well agents use tools than on the tools themselves. A theoretically superior search tool fails if the agent cannot call it properly.

Testing Simple Filesystem Operations

Letta recently added filesystem support, allowing agents to attach files and use basic operations:

grep for text matching
search_files for semantic search
open and close for file access

We tested this approach on LoCoMo, a question-answering benchmark using long conversations between fictional speakers. Instead of specialized memory tools, we simply placed conversation histories into files.

The agent runs on GPT-4o mini with minimal constraints: start with search_files, continue searching until ready to call answer_question. The agent decides what to search for and how many iterations to perform.

Result: 74.0% accuracy - significantly above the 68.5% reported by specialized memory tool Mem0.

Why Filesystem Beats Specialized Tools

Agents excel at using familiar tools likely present in their training data. Filesystem operations appear extensively in coding datasets, making them natural for LLMs to understand and execute.

Agents can:

Generate better search queries than the original questions
Transform “How does Calvin stay motivated when faced with setbacks?” into “Calvin motivation setbacks”
Continue searching iteratively until finding relevant information

Specialized memory tools, while potentially more sophisticated, may be harder for agents to use effectively due to limited training exposure.

Agent Capabilities Matter Most

Effective memory depends on successful information retrieval when needed. This makes agent tool-use capabilities more important than specific retrieval mechanisms.

Simple tools offer advantages:

Higher likelihood of appearing in training data
Better agent understanding and execution
More reliable performance across different models

Complex solutions like knowledge graphs may help in specific domains but often sacrifice usability for theoretical improvements.

Better Memory Evaluation

Current benchmarks like LoCoMo focus on retrieval rather than dynamic memory management. The Letta Memory Benchmark provides better evaluation by:

Keeping framework and tools constant
Testing memory interactions dynamically
Comparing model capabilities directly

Task-based evaluation also proves valuable. Terminal-Bench measures agent performance on long-running tasks requiring memory to track state and progress.

Implementation Guidance

Start with simple filesystem tools before adding complexity:

Use basic file operations (grep, search, open)
Let agents generate their own search queries
Allow iterative searching until information is found
Add specialized tools only when simple approaches fail

Next Steps

Test your agent’s memory capabilities using the Letta Memory Benchmark or build agents with filesystem tools on the Letta Platform.

Simple filesystem operations often outperform specialized memory tools because agent capabilities matter more than retrieval mechanisms.