FINTAGGING: Benchmarking LLMs for Extracting and Structuring Financial Information
Large language models excel at many tasks, but their ability to handle financial tagging remains underexplored. XBRL (eXtensible Business Reporting Language) requires mapping thousands of financial facts to over 17,000 standardized US-GAAP concepts—a complex process that existing benchmarks oversimplify.
The Problem with Current Benchmarks
Current XBRL tagging benchmarks reduce the task to flat classification over small concept subsets, ignoring the hierarchical nature of financial taxonomies and structured documents. They typically cover only a few thousand concepts instead of the full 17,000+ taxonomy, making it impossible to assess performance on rare but important concepts.
Most problematically, these benchmarks formulate tagging as extreme classification, which becomes brittle when the full taxonomy cannot fit in context. This approach places heavy emphasis on recalling label identifiers rather than understanding semantic relationships.
Introducing FINTAGGING
FINTAGGING addresses these limitations by decomposing XBRL tagging into two distinct subtasks:
Financial Numeric Identification (FinNI)
Extracts numerical entities from mixed text and table contexts and classifies them by data type (monetary, percentage, shares, per-share, or integer). This subtask evaluates whether models can locate financial values using semantic cues across different document structures.
Financial Concept Linking (FinCL)
Maps extracted entities to the correct US-GAAP concepts through semantic alignment rather than label recall. This reframes tagging as a retrieval-reranking task, making full-taxonomy evaluation feasible for LLMs.
Key Findings
Evaluating 13 state-of-the-art LLMs reveals a consistent extraction-linking gap:
Strong Extraction Performance: Top models like DeepSeek-V3 and GPT-4o reliably identify numerical facts from complex financial documents, achieving F1 scores above 0.69 on FinNI.
Weak Concept Alignment: The same models struggle with fine-grained concept linking, with FinCL accuracy remaining below 0.19 even with retrieval assistance. Most errors involve near-neighbor concepts that share surface cues but differ in scope or context.
Error Propagation: Pipeline analysis shows cascading failures. End-to-end accuracy improves when gold entities are provided and improves again when correct concepts appear in candidate sets, but significant gaps remain even under ideal conditions.
Why This Matters
The extraction-linking gap reveals that representations sufficient for identifying financial facts don’t consistently encode the semantic conditions needed to distinguish closely related concepts. For example, models might correctly extract “restricted cash” values but confuse aggregate concepts with component-specific ones.
This finding has practical implications for financial AI systems. While LLMs show promise for automating parts of the XBRL workflow, they require careful design to handle the semantic precision demanded by regulatory reporting.
Implementation Details
FINTAGGING uses real SEC 10-K filings from 142 companies across 11 industry sectors. The benchmark includes:
- 28,787 instances for FinNI evaluation
- 261,457 query-answer pairs for FinCL evaluation
- Expert validation with 96% agreement (κ = 0.81)
- Full US-GAAP coverage spanning 17,688 unique concepts
The evaluation framework supports both individual subtask assessment and end-to-end pipeline analysis, enabling researchers to isolate specific failure modes.
Next Steps
FINTAGGING provides a foundation for developing more reliable financial AI systems. Future work should focus on improving semantic alignment capabilities, perhaps through targeted fine-tuning or hybrid approaches that combine retrieval with structured reasoning.
The benchmark and datasets are available on GitHub and Hugging Face, supporting reproducible research in financial NLP and regulatory technology.