In this comprehensive study, we evaluate how ChatGPT-4 with search capabilities, and Fintool handle complex financial questions regarding public equity securities. The benchmark leverages the FinanceBench top 100 questions - an industry-leading standard developed by AI researchers at Patronus and Stanford in collaboration with financial domain experts.
The FinanceBench dataset was created through collaboration between AI researchers and 15 financial industry domain experts, comprising high-quality questions and answers derived from public financial documents including SEC filings (10-Ks, 10-Qs, 8-Ks), earnings reports, and earnings call transcripts. While the original benchmark includes document context, we modified the testing methodology to better reflect real-world usage by finance professionals: questions are presented to the AI assistants without any supporting documents, requiring them to provide accurate answers in a single interaction - similar to how professionals would use these tools in practice.
What drove operating margin change as of FY2022 for 3M?
Roughly how many times has AES Corporation sold its inventory in FY2022?
Our testing revealed that even when ChatGPT-4 successfully locates and searches through the correct company websites and financial documents, it frequently hallucinates numbers and provides incorrect answers. In our benchmark tests, we observed numerous instances where ChatGPT-4 would confidently state financial metrics that were completely inaccurate, despite having access to the correct source material. This issue is further compounded by ChatGPT-4's inability to provide proper citations or references for its answers. Without clear citations linking back to specific financial documents, pages, or sections, users have no way to verify the accuracy of the information provided or cross-reference the data with original sources.
The most significant issue is ChatGPT's reliance on retail investor websites and unreliable third-party sources. Instead of accessing authoritative SEC filings directly, ChatGPT often takes shortcuts by pulling data from sites like SeekingAlpha, StockAnalysis.com, and other retail-focused platforms that can contain outdated or incorrect information. This leads to ChatGPT retrieving and confidently presenting wrong data to users. Even when ChatGPT does find information, there's no built-in way to verify the sources or validate that the data came from official financial documents rather than unreliable web scrapes. This creates a dangerous situation where users may be making financial decisions based on unverified or incorrect information pulled from questionable sources:
In contrast to ChatGPT-4's limitations, Fintool demonstrates significant advantages by exclusively searching authoritative SEC filings, earnings call transcripts, and investor presentations. Every answer is traced back to the exact source - whether it's a specific line in a 10-K filing, a cell in a financial statement table, or a quote from an earnings call. When computing financial metrics, Fintool shows its step-by-step calculations and the precise data points used, providing complete auditability of its analysis.
The comparison image shows how ChatGPT can appear confident and authoritative while providing completely incorrect numbers and faulty calculations. Without clear citations or ability to quickly verify accuracy, even professional investors can be misled. In contrast, Fintool (shown on the right) provides full auditability by sourcing directly from SEC filings and transcripts, allowing users to verify every data point and calculation:
Our comprehensive analysis of ChatGPT-4 with search and Fintool using the FinanceBench top 100 questions reveals a significant performance gap between the two platforms. While ChatGPT-4 achieved only a 31% accuracy rate, Fintool demonstrated superior performance with 98% accuracy on the same test set.
The key differentiator lies in their fundamental approaches. ChatGPT-4's reliance on web searches and retail investor websites, while convenient, leads to significant accuracy issues through hallucinations and unreliable data sources. These limitations become particularly problematic when dealing with complex financial metrics, where precision and verifiability are crucial.
In contrast, Fintool's direct engagement with authoritative SEC filings and earnings reports provides a more robust and reliable solution for financial analysis. The ability to verify every data point against original source documents, combined with proper citations and comprehensive coverage, makes it a more suitable tool for professional financial analysis where accuracy and auditability are paramount.
This study underscores the fundamental limitations of using general-purpose AI models for specialized financial analysis. While ChatGPT-4 represents a remarkable achievement in general AI capabilities, the complexity and precision required in financial analysis demand purpose-built solutions that prioritize accuracy, verifiability, and direct access to authoritative sources.
Join leading investment firms using Fintool to gain a competitive edge in financial analysis.