How Fintool achieved 90% accuracy on real-world financial research tasks—outperforming Claude Sonnet 4.5 by 35%, OpenAI o3 by 43%, and delivering results 25x faster and 183x cheaper than human analysts.
Financial analysis has long been a domain where AI promised transformation but struggled to deliver reliable results. While general-purpose language models have made impressive strides in reasoning and natural language understanding, they consistently fail when tasked with the precision-critical work that entry-level financial analysts perform daily: extracting accurate metrics from SEC filings, calculating year-over-year growth rates, identifying revenue trends across segments, and verifying financial statements.
The Finance Agent Benchmark, developed by vals.ai in collaboration with Stanford researchers, a Global Systemically Important Bank, and industry experts, addresses this gap by testing AI models on real-world financial research tasks that require complex analysis using recent SEC filings. Unlike academic benchmarks that test general reasoning, Finance Agent Benchmark evaluates whether AI can actually perform the time-intensive work expected of junior analysts at investment banks, hedge funds, and private equity firms.
The benchmark features 537 expert-authored questions spanning nine categories—from basic quantitative retrieval to complex financial modeling and market analysis. Each question was validated through rigorous peer review to ensure accuracy and relevance, with a focus on documents published in 2024 (after most training cutoffs) to test true analytical capability rather than memorization.
The Finance Agent Benchmark isn't just another question-answering dataset. It's a comprehensive evaluation framework that tests whether AI agents can autonomously complete complex financial research tasks from start to finish—exactly as a human analyst would.
The benchmark organizes tasks into nine categories based on complexity and frequency in real investment banking contexts:
“What was the quarterly revenue of Salesforce (NYSE:CRM) for the quarter ended December 31, 2024?”
Tests ability to locate and extract specific numerical data from financial statements.
“Describe the product offerings and business model of Microsoft (NASDAQ:MSFT)?”
Tests comprehension and summarization of business descriptions.
“What is % of revenue derived from AWS in each year and the 3 year CAGR from 2021-2024 of Amazon?”
Requires extracting data and performing calculations like growth rates.
“How did Lam Research's revenue compare to management projections (at midpoint) on a quarterly basis in 2024?”
Tests ability to compare actual results against guidance across multiple quarters.
“How much M&A firepower does Amazon have as of FY2024 end including balance sheet cash, non-restricted cash and other short term investments, and up to 2x GAAP EBITDA leverage?”
Requires multi-step analysis combining balance sheet data with financial ratios.
“Compare the quarterly revenue growth of FAANG companies between 2022-2024.”
Tests ability to analyze multiple companies simultaneously and identify trends.
What makes Finance Agent Benchmark unique is its agentic evaluation framework. Unlike traditional benchmarks that provide models with pre-selected documents, this benchmark requires AI agents to autonomously navigate the research process using specialized tools:
Direct access to the SEC's EDGAR database containing all public company filings (10-K, 10-Q, 8-K, S-1, etc.)
General web search for supplementary information
Extract and save content from HTML pages in a key-value database
Retrieve stored document content from the database and manage context windows
This setup mirrors how human analysts actually work: they search for relevant filings, parse through lengthy documents, extract key information, perform calculations, and cross-reference multiple sources. The benchmark measures not just whether models can answer questions when handed the right data, but whether they can autonomously find, parse, and analyze that data themselves.
The benchmark employs a sophisticated LLM-as-judge evaluation system that addresses common pitfalls in automated grading. Rather than evaluating answers holistically (which often produces unreliable results), the system uses rubric-based assessment:
When vals.ai evaluated 22 leading AI models on the Finance Agent Benchmark, the results revealed a sobering reality: even the most advanced general-purpose AI models struggle significantly with real-world financial analysis tasks. The best-performing general model, OpenAI's o3, achieved only 48.3% accuracy—barely better than a coin flip for professional financial work.
Model | Accuracy | Cost/Query | Latency |
---|---|---|---|
Anthropic Claude Sonnet 4.5 | 55.3% | $1.41 | 167s |
OpenAI o3 | 48.3% | $0.74 | 180s |
OpenAI GPT 5 | 46.9% | $0.78 | 504s |
xAI Grok 4 | 40.3% | $1.14 | 516s |
The benchmark revealed a clear logarithmic relationship between cost and accuracy, with diminishing returns beyond $1 per question. No general-purpose model surpassed 56% accuracy, highlighting fundamental limitations in how these systems handle specialized financial workflows. The challenges were particularly acute in complex tasks requiring multi-step reasoning, cross-document analysis, and precise numerical calculations.
Moreover, tool usage patterns revealed critical differences between top performers and struggling models. Higher-performing models like Claude Sonnet 4.5 made more exploratory tool calls (averaging 12.1 conversational turns) and demonstrated balanced usage across EDGAR Search, HTML parsing, and information retrieval. In contrast, lower-performing models either made too few tool calls (failing to explore thoroughly) or made numerous erroneous calls without adjusting strategy.
When Fintool was evaluated on the publicly available subset of the Finance Agent Benchmark (50 questions from the public.csv file), the results demonstrated what purpose-built financial AI can achieve: 90% accuracy—dramatically outperforming all general-purpose models.
These results aren't just incremental improvements—they represent a fundamental difference in capabilities between general-purpose AI and purpose-built financial intelligence systems. Fintool's 90% accuracy means it successfully completed 45 out of 50 expert-level financial research tasks, while Claude Sonnet 4.5 managed only 28 and o3 just 24.
Fintool's advantages extend well beyond accuracy. When compared to human analysts performing the same tasks, Fintool delivers:
Average time per question:
Average cost per question:
Note: Human analyst costs and times are calculated based on entry-level financial analyst compensation and exclude time for question creation, answer review, and other overhead. These represent conservative estimates of the actual time and cost savings.
These efficiency gains are transformative for financial teams. A task that would take a junior analyst nearly 17 minutes—pulling up filings, navigating to the right sections, extracting data, performing calculations—is completed by Fintool in under 40 seconds with higher accuracy. At scale, this means hundreds of hours saved per analyst per month, allowing teams to focus on higher-value strategic analysis rather than routine data extraction.
Fintool's 35% accuracy advantage over the best general-purpose AI isn't the result of slightly better algorithms—it stems from fundamental architectural differences in how Fintool approaches financial analysis. While general AI models rely on web search and generic document parsing, Fintool is built from the ground up for financial workflows.
General AI models access SEC filings through web search and generic HTML parsing—the same way a human might visit the SEC website. This approach introduces multiple failure points: search results may miss relevant filings, HTML parsing can lose table structures, and context windows struggle with lengthy documents.
Fintool maintains direct integration with the EDGAR database and proprietary parsers specifically designed for financial documents. The system understands SEC filing structures (10-K sections, 10-Q formats, 8-K event types), can navigate XBRL tagged data, and preserves the semantic structure of financial tables and statements. This means Fintool doesn't need to “figure out” where to find data—it already knows.
Beyond SEC filings, Fintool incorporates multiple verified financial data sources: structured databases of company fundamentals, historical price data, analyst consensus estimates, and corporate actions. This eliminates the need for web scraping and reduces the risk of encountering stale or incorrect information.
Fintool's underlying models have been specifically trained and fine-tuned on financial text, giving them deep understanding of financial terminology, accounting concepts, and industry conventions. The system knows that “EBITDA” means earnings before interest, taxes, depreciation, and amortization—and how to calculate it from GAAP financials. It understands that “beat or miss” requires comparing actual results to guidance, not just reporting numbers.
Every number Fintool reports includes precise citations: the specific filing (10-K 2024), the section (Item 8: Financial Statements), and often the exact table or line item. This isn't just for transparency—it's essential for professional use. Investment decisions require verifiable sources, and Fintool's citation system is built for audit trails and compliance.
The Finance Agent Benchmark evaluation harness provides only basic tools: EDGAR search, Google search, HTML parsing, and information retrieval. Fintool's production system goes far beyond these basics with specialized tools for financial calculations, time-series analysis, peer comparisons, and multi-document synthesis. These tools aren't generic utilities—they're purpose-built for financial workflows, handling edge cases like fiscal year mismatches, segment reclassifications, and non-GAAP adjustments.
The difference between 55% accuracy (Claude Sonnet 4.5) and 90% accuracy (Fintool) isn't just academic—it determines whether AI can be trusted for professional financial work. At 55% accuracy, analysts must verify every response, effectively doubling their workload. At 90% accuracy, AI becomes a reliable first draft that analysts can quickly review and build upon.
Consider a typical use case: an analyst needs to screen 50 companies for potential investment opportunities, extracting key metrics, growth trends, and competitive positioning for each. With a general AI model at 55% accuracy:
With Fintool at 90% accuracy:
This accuracy threshold—around 85-90%—represents the inflection point where AI transitions from “sometimes helpful” to “reliably valuable.” Below this threshold, verification overhead negates efficiency gains. Above it, AI becomes a genuine force multiplier for financial professionals.
The Finance Agent Benchmark results tell a clear story: general-purpose AI, no matter how sophisticated, cannot match the accuracy and reliability of purpose-built financial intelligence systems. As foundation models continue to improve, the gap may narrow—but architectural advantages in domain-specific data access, specialized tools, and financial context will continue to matter.
For financial institutions evaluating AI solutions, these benchmark results provide critical guidance. The difference between 55% and 90% accuracy isn't just about better models—it's about fundamentally different approaches to financial analysis. Institutions that deploy general AI models will struggle with reliability and verification overhead. Those that adopt purpose-built solutions like Fintool can achieve genuine productivity gains while maintaining the accuracy standards professional finance demands.
Fintool's performance on the Finance Agent Benchmark—90% accuracy, 25x faster than analysts, 183x cheaper—demonstrates that AI is ready for professional financial workflows. Not general AI trained on internet text, but specialized AI built from the ground up for financial analysis, with direct data access, domain expertise, and audit-ready citations. This is the future of financial intelligence: AI that investment professionals can actually trust.