In this comprehensive study, we evaluate the performance of three cutting-edge AI applications and models in the realm of financial analysis. Our benchmark includes GPT-4o, GPT-4o enhanced with OpenAI's latest search capabilities, and Fintool, a specialized AI equity research analyst designed to process SEC filings and earnings call transcripts. This comparison aims to shed light on the current state of AI in financial question-answering and analysis.
Source: FinanceBench, the benchmark for financial LLM questions
To conduct our benchmark, we utilize the FinanceBench top 100 questions, widely recognized as the industry-leading standard for assessing LLM performance on financial queries. This dataset, developed collaboratively by AI researchers at Patronus and Stanford, along with 15 financial industry domain experts, offers a high-quality collection of questions and answers derived from publicly available financial documents such as SEC filings (10-Ks, 10-Qs, 8-Ks), earnings reports, and earnings call transcripts. Note that we changed the methodology to make it closer to real usage by finance professionals. We ask the questions to the LLM without providing any context or documnts and expect a good answer in one shot.
Here are some examples of questions from the FinanceBench dataset:
Getting a correct financial answer from a LLM is hard. LLMs aren't trained on domain-specific knowledge which limits their ability to provide accurate financial answer. Additionally, models need up-to-date financial information to offer precise insights. Financial questions often involve numerical reasoning too, adding an extra layer of complexity for text models. Answering these questions requires models to handle both unstructured inputs, like qualitative questions in free-text form, and structured inputs, such as tabular financial metrics. The model needs to parse long passages of text, which is more challenging than reasoning about short strings from a single source. For this reason, an advanced model like GPT-4o, alone, can't get any financial questions right. LLMs aren't trained on domain-specific knowledge, which limits their ability to provide accurate financial answers.
These factors collectively make providing accurate financial answers difficult for large language models. However, leveraging Retrieval-Augmented Generation (RAG) techniques can significantly enhance their performance. Using a longer context window to incorporate relevant evidence like SEC filings, models can draw on more extensive data, including historical trends, recent financial news, and detailed company reports. This not only improves the precision and relevance of their answers but also ensures coherence and consistency when processing extended passages of text, allowing them to reason more effectively about intricate financial scenarios.
Feature | Fintool | ChatGPT4-o |
---|---|---|
Search Earnings call | ||
Search Investor conference | ||
Search SEC Filings | ||
Source all answers | ||
Source linked to a precise sentence | ||
Real-time financial data | ||
No Material, Non Public Information (MNPI) | ||
Suggests Followup questions | ||
Ask questions on thousands of companies |
ChatGPT-4o with access to the internet got 31% of answers right and 69% wrong. The unfortunate part for ChatGPT4o is that it doesn't know when to search the internet and which website to look for. It managed to get some right answers by looking at company websites or specialized publications like SeekingAlpha or WallStreetMine.
The main drawback is that ChatGPT, while linking to a web page, doesn't mention where it found the information. Because right and wrong answers look exactly the same it's impossible to check the data accuracy without opening the filings on the side as a source of truth.
In this example, ChatGPT-4o searched the right company website but retrieved the wrong number. It looks almost right but the answer is wrong!
The main drawback is that ChatGPT, while linking to a web page, doesn't mention where it found the information. Because right and wrong answers look exactly the same it's impossible to check the data accuracy without opening the filings on the side as a source of truth.
Almost all of Fintool's mistakes happened when calculating numbers such as "$AMD quick ratio" or "$AMZN days payable outstanding." We are confident Fintool can reach 100% before the end of the year.
The contrast with ChatGPT-4o is remarkable. Fintool not only retrieves accurate numbers but also explains the calculations in detail and provides precise citations from SEC filings. These citations are crucial for determining the reliability of the information presented.
The cherry on the cake is Fintool citing exact paragraphs and numbers in filings making it the only solution. The quality of answers are also night and day. Fintool provides more context and creates tables that you can export in CSV.
The results from the FinanceBench evaluation reveal that Fintool significantly outperforms both ChatGPT-4o with RAG and GPT-4o without RAG. Fintool achieves a notably higher percentage of correct answers on the top 100 financial questions, demonstrating its superior accuracy and reliability in providing detailed, well-sourced, and context-rich responses. This clearly establishes Fintool as the industry leader in leveraging AI for financial analysis.