Benchmarking LLM on Financial Questions

In this comprehensive study, we evaluate the performance of three cutting-edge AI applications and models in the realm of financial analysis. Our benchmark includes GPT-4o, GPT-4o enhanced with OpenAI's latest search capabilities, and Fintool, a specialized AI equity research analyst designed to process SEC filings and earnings call transcripts. This comparison aims to shed light on the current state of AI in financial question-answering and analysis.

Source: FinanceBench, the benchmark for financial LLM questions

Methodology: FinanceBench Top 100 Questions

To conduct our benchmark, we utilize the FinanceBench top 100 questions, widely recognized as the industry-leading standard for assessing LLM performance on financial queries. This dataset, developed collaboratively by AI researchers at Patronus and Stanford, along with 15 financial industry domain experts, offers a high-quality collection of questions and answers derived from publicly available financial documents such as SEC filings (10-Ks, 10-Qs, 8-Ks), earnings reports, and earnings call transcripts.

Here are some examples of questions from the FinanceBench dataset:

  • What is the FY2018 capital expenditure amount (in USD millions) for 3M?
  • Is 3M a capital-intensive business based on FY2022 data?
  • Does Adobe have an improving Free cash flow conversion as of FY2022?
  • What is Amazon's FY2017 days payable outstanding (DPO)?
  • What was the key agenda of the AMCOR's 8k filing dated 1st July 2022?
  • How much was the Real change in Sales for AMCOR in FY 2023 vs FY 2022, if we exclude the impact of FX movement, passthrough costs and one-off items?

GPT4o without search

GPT4o with search

Fintool

Challenges in AI-Powered Financial Analysis

Getting a correct financial answer from a LLM is hard. LLMs aren't trained on domain-specific knowledge which limits their ability to provide accurate financial answer. Additionally, models need up-to-date financial information to offer precise insights. Financial questions often involve numerical reasoning too, adding an extra layer of complexity for text models. Answering these questions requires models to handle both unstructured inputs, like qualitative questions in free-text form, and structured inputs, such as tabular financial metrics. The model needs to parse long passages of text, which is more challenging than reasoning about short strings from a single source. For this reason, an advanced model like GPT-4o, alone, can't get any financial questions right. LLMs aren't trained on domain-specific knowledge, which limits their ability to provide accurate financial answers.

These factors collectively make providing accurate financial answers difficult for large language models. However, leveraging Retrieval-Augmented Generation (RAG) techniques can significantly enhance their performance. Using a longer context window to incorporate relevant evidence like SEC filings, models can draw on more extensive data, including historical trends, recent financial news, and detailed company reports. This not only improves the precision and relevance of their answers but also ensures coherence and consistency when processing extended passages of text, allowing them to reason more effectively about intricate financial scenarios.

FeatureFintoolChatGPT4-o
Search Earnings call
Search Investor conference
Search SEC Filings
Source all answers
Source linked to a precise sentence
Real-time financial data
No Material, Non Public Information (MNPI)
Suggests Followup questions
Ask questions on thousands of companies

Performance Analysis: ChatGPT-4o vs. Fintool

ChatGPT-4o with access to the internet got 31% of answers right and 69% wrong. The unfortunate part for ChatGPT4o is that it doesn't know when to search the internet and which website to look for. It managed to get some right answers by looking at company websites or specialized publications like SeekingAlpha or WallStreetMine.

The main drawback is that ChatGPT, while linking to a web page, doesn't mention where it found the information. Because right and wrong answers look exactly the same it's impossible to check the data accuracy without opening the filings on the side as a source of truth.

In this example, ChatGPT-4o searched the right company website but retrieved the wrong number. It looks almost right but the answer is wrong!

ChatGPT-4o wrong answer example

The main drawback is that ChatGPT, while linking to a web page, doesn't mention where it found the information. Because right and wrong answers look exactly the same it's impossible to check the data accuracy without opening the filings on the side as a source of truth.

ChatGPT-4o hallucination example

Almost all of Fintool's mistakes happened when calculating numbers such as "$AMD quick ratio" or "$AMZN days payable outstanding." We are confident Fintool can reach 100% before the end of the year.

The contrast with ChatGPT-4o is remarkable. Fintool not only retrieves accurate numbers but also explains the calculations in detail and provides precise citations from SEC filings. These citations are crucial for determining the reliability of the information presented.

Fintool's answers are right

The cherry on the cake is Fintool citing exact paragraphs and numbers in filings making it the only solution. The quality of answers are also night and day. Fintool provides more context and creates tables that you can export in CSV.

Fintool's answers are right

Summary

The results from the FinanceBench evaluation reveal that Fintool significantly outperforms both ChatGPT-4o with RAG and GPT-4o without RAG. Fintool achieves a notably higher percentage of correct answers on the top 100 financial questions, demonstrating its superior accuracy and reliability in providing detailed, well-sourced, and context-rich responses. This clearly establishes Fintool as the industry leader in leveraging AI for financial analysis.