Spinach: SPARQL-Based Information Navigation for Challenging Real-World Questions

Knowledge Base Question Answering (KBQA) systems face a critical challenge: existing datasets fail to capture the true complexity of real-world queries. Researchers introduce Spinach, a new dataset and agent that tackles complex questions using dynamic SPARQL exploration over large knowledge bases like Wikidata.

The Problem with Current KBQA Datasets

Most KBQA datasets suffer from fundamental limitations. They either contain overly simple questions or use synthetically generated logical forms that don’t reflect real-world complexity. For example, WikiWebQuestions averages only 2.63 clauses per query, while synthetic datasets like KQA-Pro achieve artificially high accuracy because models can memorize limited patterns during training.

This creates a dangerous disconnect: systems that excel on synthetic benchmarks often fail when confronted with genuine user queries. The community needs datasets with both natural questions and naturally complex logical forms.

Introducing the Spinach Dataset

The Spinach dataset addresses this gap by mining real conversations from Wikidata’s “Request a Query” forum. This forum hosts discussions where users seek help writing SPARQL queries for complex information needs.

The dataset contains 320 expert-annotated question-SPARQL pairs derived from actual forum discussions. These queries are significantly more complex than existing datasets, averaging 8.89 clauses per query with 2.50 projections and involving 298 unique Wikidata properties.

Dataset Creation Process

Three Wikidata experts manually processed forum conversations spanning from July 2016 to May 2024. They:

Filtered out Wikimedia-specific optimizations and formatting clauses
Created decontextualized natural language questions that accurately capture SPARQL meaning
Disambiguated entities and properties to avoid confusion
Ensured questions reflect real information needs rather than technical debugging

The resulting dataset represents genuine complexity found in production knowledge base usage.

The Spinach Agent Architecture

The Spinach agent mimics how human experts write SPARQL queries. Rather than exploring knowledge graphs one edge at a time, it uses the full expressiveness of SPARQL for exploration and learning.

Core Design Principles

The agent follows a human-like approach:

Start simple: Begin with basic query fragments
Verify assumptions: Execute intermediate queries to understand knowledge base structure
Build incrementally: Add complexity one piece at a time
Learn from failures: Use empty results and syntax errors as learning signals

Available Actions

The agent can perform five key actions:

search_wikidata(string): Find entities and properties matching text queries
get_wikidata_entry(QID): Retrieve all outgoing edges for a specific entity
get_property_examples(PID): See usage examples for properties
execute_sparql(SPARQL): Run queries and analyze results
stop(): Mark the final query as complete

State Management

Unlike previous approaches that maintain subgraph states, Spinach tracks the complete history of actions and observations. This enables handling questions requiring large result sets or complex computations that would be impossible with fixed subgraph representations.

Evaluation Results

The Spinach agent achieves state-of-the-art performance across multiple benchmarks:

QALD-7: 31.0% improvement in F1 score
QALD-9 Plus: 27.0% improvement in F1 score
QALD-10: 10.0% improvement in F1 score
WikiWebQuestions: Within 1.6% of fine-tuned SOTA

On the challenging Spinach dataset, the agent outperforms all baselines by at least 38.1% F1, including the best GPT-4-based KBQA system.

Error Analysis

Analysis of failure cases reveals common challenges:

Property-related problems (40%): Incorrect property selection or usage
Complex SPARQL construction (30%): Difficulty with advanced query patterns
Insufficient exploration (15%): Hitting action limits before finding solutions
Semantic parsing errors (10%): Adding unnecessary constraints
Formatting issues (5%): Minor output format problems

Evaluation Methodology

The research introduces row-major generalizations of Exact Match (EM) and F1 metrics to handle multi-projection queries common in real-world scenarios. This addresses the limitation that traditional metrics assume single-field outputs, while Spinach queries average 2.5 projections per query.

Implications for KBQA Research

The Spinach dataset and agent demonstrate several key insights:

Real complexity matters: Synthetic datasets create false confidence in system capabilities
Dynamic exploration works: Full SPARQL expressiveness outperforms edge-by-edge traversal
Human-like reasoning helps: Mimicking expert query-writing strategies improves performance
Scale challenges remain: Even state-of-the-art systems achieve only 45.3% F1 on real-world queries

Future Directions

The relatively modest performance on Spinach queries (16.4% EM, 45.3% F1) indicates substantial room for improvement. The agent’s transparent reasoning process enables users to continue conversations and refine queries interactively.

The researchers have deployed Spinach publicly at spinach.genie.stanford.edu and as SpinachBot on Wikidata, providing a community resource for accessing complex knowledge base information.

This work establishes a new benchmark for evaluating KBQA systems against genuine user needs rather than artificial constraints, pushing the field toward more practical and robust question answering capabilities.