Benchmark

SWE-Lancer: Evaluating Frontier LLMs on $1 Million Worth of Real-World Software Engineering Tasks

SWE-Lancer introduces a comprehensive benchmark of over 1,400 real freelance software engineering tasks from Upwork worth $1 million USD, evaluating frontier language models on both individual contributor coding tasks …

AI · Development Signal Editorial Team

Feb 13 arxiv.org 3 min read

FINTAGGING: Benchmarking LLMs for Extracting and Structuring Financial Information

This paper introduces FINTAGGING, the first comprehensive benchmark for evaluating large language models on XBRL tagging tasks, decomposing the complex process into financial numeric identification and concept linking …

AI · Data Signal Editorial Team