ML Topic Modeling with Transformers & LLMs
Investors and financial advisors spend thousands of hours pouring over company reports to understand a companyβs potential vulnerabilities. Utilizing machine learning algorithms to read and interpret reports for them could save them most of that time and significantly increase productivity. What was merely a dream of possibilities a couple of years ago is now becoming a reality.
Natural Language Processing (NLP) - a broad area of tech aimed at allowing computers to understand, interpret and respond to human language - has made major leaps forward in recent years. Specifically, the explosive improvement of Large Language Models (LLMs) like ChatGPT, has brought NLP many steps closer to matching a humanβs ability to read and understand large corpuses of text. So I was very excited to work with a NYC-based fintech company focused on developing AI tools for financial professionals, to use NLP technologies to identify and quantify commonly discussed risks in a large corpus of annual company reports.
Programmatically parsing, categorizing and quantifying topics in natural language is a difficult endeavor for many reasons. The challenge of working with language rather than numbers is that phrases, and even individual words, can have more than one meaning and format. Thatβs why each NLP project requires a specialized set of technologies and methodologies.
Indeed, I tried a number of different ML models and methodologies to identify common risk factors in reports. Traditional topic modeling algorithms that pull thematic topics from natural language text didnβt produce great results. Ultimately, I was able to identify and quantify common risk factors discussed in company reports, but the only workable method relied heavily on the top of the line OpenAI LLM GPT-4 model, which was too costly for large-scale implementation.
The good news is that as AI models will continue to rapidly advance, their cost will decline, and it will make financial sense to use my methodology. In fact, it may already be worth the cost to larger financial SaaS providers.
The Text Corpus
Publicly traded companies file annual reports called 10-Ks with the U.S. Securities and Exchange Commission (SEC). These reports provide a comprehensive overview of the company's financial performance, operations and risk factors. 10-Ks are written by lawyers, so they tend to be unnecessarily formal and verbose - the priority is avoiding liability (getting sued by shareholders) rather than clearly communicating an idea. I canβt imagine any human enjoys reading them, which makes them an even better fit for NLP!
The drawback is that ML models trained using most available natural language datasets donβt do as good of a job interpreting βlegaleseβ (the language of lawyers which is still English, but much more formal and technical) as they do regular English. Another drawback is that the text makes no mention of the severity of one risk relative to another, which makes quantifying the severity of a risk for a particular company much more difficult.
The formatting of the corpus was also not ideal for this particular project. It is standard practice to remove all formatting from text for NLP. In this case, however, retaining headings and paragraphs would have been helpful because headings tend to summarize the theme of a section and paragraphs demarcate units of meaning as they tend to focus on a single idea.
The fintech company provided a set of roughly 400 10-K reports from the last 3 years pulled from the SECβs EDGAR database. I focused solely on section 1A of the 10-Ks which discusses significant risks to the company. The Risk factors sections can be anywhere from 10 to 30 pages (4,000 - 15,000 words) long.
I attempted to redownload the corpus with all the original formatting intact, but quickly realized that cleaning the text was a huge endeavor in itself (which is why it was a previous studentβs capstone project).
As with most natural language, risk factor sections of company 10-Ks didnβt just list risks as bullet points. For example, one text might discuss cyber-security risks, while another might call the same risk technological security or data breaches - or, in legalese: βcarrying out activity to disrupt our systems or gain access to confidential or sensitive information.β Moreover, much of the text is devoted to the potential consequences of a risk like share price declines, what lawyers call βmaterial damages.β
Methods Overview
Based on scholarsβ prior approaches and my own knowledge, risk topic modeling can be approached in two major ways: bottom-up and top-down. The bottom-up approach starts from the full corpus text and forms clusters of text with similar terms or meaning. This is the traditional approach for topic modeling algorithms. The top-down approach starts from a pre-made taxonomy of risk types and matches chunks of text to a particular risk type.
These methods were meant to identify risks, but the goal was to also be able to quantify them. As I mentioned, 10-Ks never discuss the severity of one risk vs another, so there is no way to quantify risks based on the language used to describe them. The only way to quantify a risk in a particular 10-K was some count of the number of times it was mentioned.
This led to another complication, defining what constitutes a single unit of discussing a risk. This leads to a series of questions: If an entire paragraph discussed a risk, was that one unit of discussion, or was each sentence a unit of discussion? What about instances when a risk was briefly mentioned as part of a list of many risks? What about a sentence discussing the consequences of a risk that came after a sentence stating that risk? While I was able to do a decent job identifying risks, I was still far from quantifying them.
Models Overview
I used three basic types of models for risk identification: term clustering models, semantic clustering models, and LLMs. Iβll explain these in detail below, but, in case you want to skip those sections, Iβll summarize here as well.
Term clustering refers to models that find terms (single or multiple words) that frequently occur together in a document. Each list of frequently co-occurring terms comprises a topic. Semantic clustering follows the same general process as term clustering, but instead of looking at individual terms, semantic models use various methods to understand and represent the meaning of the text so they can cluster together pieces of text with similar meanings into topics.
LLMs come much closer to a human level of semantic understanding that includes broad context. While they also rely on semantic transformers like the models used for semantic clustering, they donβt cluster. Instead, they respond to prompts like, βwhat major risk topics are in this chunk of text?β based on patterns learned during their training on enormous corpuses of text.
Latent Dirichlet Allocation (LDA)
LDA is a topic model meant to discover commonly discussed topics in a corpus of documents. LDA looks for words that tend to co-occur in documents. In other words, an LDA model assumes one topic per document. This obviously didnβt work well to find the many risks (topics) discussed in each 10-K.
To make the LDA work for the goals of this project, I broke the corpus into much smaller βdocumentsβ: chunks of 50,100 and 200-words of text. I created a function to iterate through each chunk size, as well as numerous other LDA hyperparameters to find the best performing model in terms of coherence and perplexity (metrics used to evaluate LDA). One hundred word documents worked far better than larger or smaller chunk sizes.
Topics from my best performing model were much more coherent than the results I got running LDA on full risk factor sections of 10-Ks. Here are a few examples:
The word clouds show the top 20 words most likely to occur in a document about each topic. Font size reflects the probability of each word appearing in a document about that topic.
At least some of the word clouds immediately represented risk topics like credit/liquidity risks or competitor risks. But the models were most successful with only 20 risk topics or categories, meaning the risks were quite broad, which is not particularly useful for financial analysts. I looked at model results with a larger number of topics, but the risks were not as clear there, so they wouldnβt work well either.
BERTopic
Googleβs Bidirectional Encoder Representations for Transformers model (BERT) is an AI model designed to understand the semantic meaning of natural language. (BERT was trained to predict hidden words in text, so it considers groups of words together.) This allows BERT to embed or encode text into numerical representations that capture its meaning, including some degree of context. BERTopic uses BERT to group text chunks together into topics by mathematically comparing their numerical representations.
Unlike LDA, the BERTopic model doesnβt assume a single document has a single topic. It can find multiple topics for each document. But, in this case, the 1As are so large and discuss such a diverse set of topics that BERTopicβs embeddings likely spanned multiple topics, which led to incoherent results.
So, just as I did for LDA, I ran BERTopic with 100-word text chunks as the βdocumentβ unit and got much better results. BERTopic had the best results (according to various metrics) with 30 topics, which again meant the risks were quite broad and didnβt cover the full array of risks actually discussed in the 10-Ks. Here are some example word clusters representing the topics:
Large Language Models
LLMs have a much more sophisticated βunderstandingβ of text than word or semantic clustering models like LDA and BERTopic. One way to think of this is that they donβt weight each word or each wordβs semantic meaning equally like the simpler topic modeling algorithms. Instead, they have a much deeper understanding of context so they can do things like distinguish between text discussing a risk and text discussing the causes or consequences of that risk.
The basic process for LLMs to identify risks was to cut the corpus into smaller chunks of text, prompt the LLM through API calls (or locally) with a handful of text chunks and directions and get back a list of risks discussed in the text. It seems quite simple: just give the model a few chunks of text and ask it to identify the risks. Unfortunately, it isnβt.
First of all, LLMs have limits on the size of prompts they allow or can make sense of. In this case, the prompt had to include the 1A text for the model to parse, directions on how to identify risks (e.g., by a broad category and more narrow sub-category) and directions on how to format responses to allow them to be easily integrated into the code. It also seemed to help to give the LLMs some risk categories to start with and examples of correct response formats, which further added to prompt size.
Second, LLMs are not great at following directions. This is due in part to their design that deliberately incorporates some randomness to mimic human response patterns and also in part to their limited attention per prompt. Consequently, the more complex the query or βpromptβ you give the LLM, the less likely it is to follow your directions. In my case, the models often returned incorrectly formatted responses - like full sentences instead of just comma-separated lists of the risk categories it found - that were difficult to parse.
Third, LLMs are also not quite up to the level of a human in understanding text. That leads to errors in identifying risks and even to completely overlooking parts of the text.
Iterative Risk Identification Method
Finding the correct ways to divide the corpus into chunks to prompt the LLMs was largely a process of trial and error. I was able to realize the limitations listed above fairly quickly and overcome them by using the OpenAI playground - a UI to query various OpenAI models. (I also tried Mistral 7b, an open source LLM built by META, but the results were not good.)
βPrompt Engineeringβ, best practices for wording LLM prompts to get good results, has become a buzzword of late in computer science for good reason. The wording of prompts was key to getting the models to accurately understand directions and format responses.
But with more complex tasks and large corpuses of text, getting LLMs to work well is not just about a single prompt but rather setting up a larger algorithm that strings together multiple prompts and integrates the LLMβs responses in an iterative process. A common approach to these sorts of multi-prompt LLM algorithms is known as βchain of thoughtβ or breaking down a larger problem into smaller logical steps much like a human would.
My first attempt at a full methodology for getting LLMs to accurately identify risk factors utilized βchain of thought,β particularly a process of iterative refinement where the modelβs responses are fed back to the model in future prompts. I called this method Iterative Risk Identification because it involved multiple levels of iteration.
Note that the goal of this method was simply to find a full taxonomy of every type of risk discussed in the corpus. This process did not include trying to quantify the risks by counting how many times they appeared in the corpus.
The Iterative Risk Identification method began with a risk-identification loop that included two major steps. First, I prompted the model with a new set of section 1A text chunks, a set of broad risk factor categories and a list of risk categories it had identified so far. This prompt included directions to find risks in the text and categorize them using one of the given broad risk factors and a narrower sub-category of that broad risk factor, as well as how to return responses (with examples from the list).
βHereβs a starting list of major risk categories, sub-categories and descriptions for risks already identified in the corpus. Take these text chunks and identify the NEW risks they discuss by adding to the given list. Respond in exactly the same format as the entries in the current list.β
Unfortunately, no matter how I adjusted hyperparameters to reduce randomness in the LLMs responses, they were not able to stick exactly to the formatting required. So I had to include a parsing function to take in their responses and correctly format them to add to the running list of identified risk factors.
(Note: if you need to enlarge any of the images that follow to read them, your best bet is to right click and open them in a new tab.)
As the lists of identified risks grew, so did the total size of the prompt. After about 10 rounds of this loop - perhaps a quarter of the average risk factors section of a single 10-K - the prompts got too big for GPT-4βs token (i.e., word) limit or for GPT-3.5 to pay enough attention to accurately respond. So, I created a consolidation function to prompt GPT-4 with the final list of risks from the loop, asking it to merge repeated risks that often had slightly different wording and then added the consolidated loop-list to a final list of risks that held all risks identified in all iterations of the loop.
The loop process ran a few thousand times to get through the entire corpus and populate what I called the final loop list. But this final loop list still had duplicate risks because the same risks were identified in different loops. I had to consolidate the final list. But then I faced a second problem: the final list of risks was too large to include in a single prompt to GPT-4 to consolidate.
This is why I included the pre-set group of major risk categories for each prompt in the original loop. After all the loops ran, I could programmatically divide the final list of risks by major risk category to give GPT-4 just the risks under one major category at a time to consolidate. (Of course, this assumed that the model hadnβt mistakenly put the same risk under multiple major categories.)
I ran an initial few tests of the Iterative Risk Identification process with a subset of the corpus and found some major problems including:
- Handling Lists: the LLMs could not identify all the risks in sentences that listed many types of risks. This was likely because the models didnβt pay enough attention to list sentences that were densely packed with many discrete subjects as compared to more narrative sentences. Lists also tend to be more challenging for LLMs to parse because they require the model to recognize and individually process each item without much context.
- Subject vs Object: the LLMs often misclassified sentences by mistaking the consequences of a risk (object) for the risk itself (subject). This was also likely due to attention limits. The models would do a better job understanding the risk vs its consequences if just one or two sentences were given, though it would not eliminate mistakes altogether even for GPT-3.5 models.
- Randomness: even with the model hyperparameters set to make the models as non-random as possible, the LLMs still sometimes responded with risks in incorrect formats, like full sentences (e.g., "this paragraph discusses the risk of β¦β), which were impossible to parse programmatically into the proper format to add them to the running list of risks.
- Consolidation Accuracy: despite lots of work on the consolidation prompt and testing various hyperparameters, even GPT-4 failed to properly merge similar risks that were just worded differently in the lists.
Iterative Taxonomy Method
To fix issues with randomness (different wording for the same risk) and consolidation, I next tried a method that gave the LLMs a full, multi-level taxonomy of risk factors. I used Cambridge Business Schoolβs Taxonomy of Business Risk, a three-level taxonomy of risks ranging from broad risk classes to narrow risk types.
This was the only thorough taxonomy of risks businesses face that I could find. But the taxonomy itself, which included risk names and definitions (a few sentences) for roughly 200 risk types, was too large to pass to an LLM in a single prompt. I had to design a logical, automated process to give models the taxonomy, text chunks, and directions.
I could have simply removed the definitions of each risk class, family, and type and significantly decreased the prompt sizes. But it was clear that the more room for interpretation I allowed the models, the less accurate they would be. Thatβs why I opted to feed the models the taxonomy in a process of iterative refinement where it would first identify risk classes in a group of text chunks, then identify risk families in the same text chunks and, finally, identify risk types.
The iterative taxonomy process went as follows:
- Give the LLM a chunk of text, directions, and top-level risk classes with their definitions and ask it to respond with the risk classes discussed in the text chunks.
- Parse the modelβs response and pull out only the mid-level risk families that fell under the risk classes it had identified.
- Query the model again with the same text chunk, but this time with the risk families and their definitions and directions to respond with the risk families discussed in the text.
- Parse the response and pulled out only the low-level risk types under the families the model had chosen.
- Query the model again with the same text chunk, but this time with the risk types and their definitions and directions to find the risk types discussed in the text.
This method worked fairly well with GPT-4 - well enough, that it could have been further refined and used as an end product for financial analysts. The model was able to accurately divide the text into parts that discussed discrete risk types, though it did skip over some sentences.
Unfortunately, this method was quite expensive to use on the entire corpus since it required passing the same text to an expensive model (GPT-4) 3 times over. Model costs, however, are likely to fall in the future, so this may soon be a viable option.
Keyword List Method
My last attempt was to lower costs by trying to utilize free models like BERT alongside more expensive models like GPT-4. In short, I created lists of terms (keywords) that would likely be used in discussing particular risks and then used BERT to score each sentence in the corpus on how similar it was to each keyword list.
As with the other methods, the actual process for a computer to execute this simple idea was quite complex. Since there were nearly 200 risk-types in the Cambridge taxonomy, I first created a function to query GPT-4 with each risk type and its definition and have the model return a list of words and phrases it thought would be used in 10-Ks to discuss the risk.
Next I tried a variety of transformer models to embed the keyword lists and every individual sentence in the corpus. I tried transformers specifically trained on financial texts, as well as other general models, but the basic BERT transformer seemed to work best.
Then I used BERT to calculate the semantic similarity between each sentence in the corpus and each keyword list and organized all of this information into a dataframe for later use.
After this initial risk identification, I attempted to improve the keyword lists by giving GPT-4 real examples of sentences from the corpus with the highest semantic similarity to each keyword list and asking the model to use those examples to improve each list. Then I reran the semantic similarity process with the improved risk keyword lists.
The Cambridge taxonomy was very granular, and the 10-Ks did not always discuss risks in such a specific way. To address that problem, I came up with a process to merge risks that were discussed together (or as one) in the reports. The Cambridge taxonomy, for example, differentiated between conventional war (between equally strong nations) and asymmetric war (between a strong and weak nation), while the 10-Ks just discussed the risk of war in general.
The risk merging process involved finding risk keyword lists whose semantic similarity to sentences in the corpus were highly correlated. I merged these highly correlated lists using the regular GPT-4 user interface so I could provide input and help the model create a final set of human-approved risk keyword lists.
I repeated the semantic scoring process with the finalized risk keyword lists (part 4 of the flowchart which Iβll leave out because it repeats processes Iβve already shown). The end result was a dataframe with each sentence in the corpus scored for semantic similarity versus each risk keyword list.
In theory, I now had something akin to a numerical likelihood that each sentence discussed each particular risk. For example, this sentence, βWe may be vulnerable to cyber-attacks from phishing,β would have a high semantic similarity score for the cyber-attacks risk keyword list and a low semantic similarity score for the corporate governance keyword list and so on. But there was no predefined threshold semantic similarity score above which I could be sure a sentence discussed a particular risk.
In order to quantify risks by counting the number of sentences discussing each risk, I needed a way to confidently identify which risk(s) each sentence discussed. In other words, for each risk type, I had to find the threshold similarity score above which I was confident the sentences did in fact discuss that risk.
Iβd envisioned a sort of binary search process to find the threshold similarity scores where I would give an LLM a few example sentences at a particular score and ask it to tell me how many of the sentences actually discussed the particular risk. If all the sentences discussed that risk, then I would move down in semantic similarity score by a preset amount to see if lower scoring sentences also definitely discussed the risk. If the sentences did not definitely discuss the risk, Iβd move up to try higher-scoring sentences. This process would quickly hone in on a threshold score.
I had to alter the process and involve some of my own oversight to make it work, but the gist was the same. Sadly, it turned out that there was no clear threshold score for many of the risk keyword lists.
I found that there were many false positives - sentences that did not discuss a risk but had a very high semantic similarity score to that riskβs keyword list. This happened for the same reason BERTopic and LDA failed in earlier attempts. The model (BERT) couldnβt tell the difference between text that discussed risks and text that didnβt either because it couldnβt distinguish subject from object or because it wasnβt intelligent enough to understand the meaning of text in context.
If the binary search process to identify threshold similarity scores had worked, I had a plan to go beyond simply counting the above-threshold sentences for each 10-K because discussion of a single risk often comprised multiple sentences. The idea was to calculate semantic similarity for an expanding window of sentences surrounding the initial above-threshold sentence. My hope was that shifts in the semantic similarity scores would show demarcations between groups of sentences that discussed different risks.
This was also probably wishful thinking. In practice, this methodology would require extensive refinement to work. But some way of trying to identify groups of sentences discussing a single risk is important for accurately quantifying the risks discussed in 10-Ks, regardless of the method used to identify them. Judging from my experience thus far, this would most likely have to be built into the iterative taxonomy method with an additional query asking GPT-4 to consider multiple contiguous sentences discussing a single risk.
Future Work
Besides getting the risk quantification part of the process working well, there are many other efforts that could improve model outcomes for this particular project.
Given more time, the first thing Iβd want to do is re-extract the text of section 1As of the 10-K reports, keeping section headers and paragraph markers intact. Highlighting section headers would help LLMs more easily understand the context of each text chunk. Paragraphs are the ideal way to divide up the text since they tend to discuss a single theme or at least a coherent group of risks.
The next most helpful effort would be to improve the risk taxonomy. While the Cambridge taxonomy was adequate, it was also overkill. A more succinct taxonomy focused only on risks discussed in 10-Ks or corporate financial reports would ease the burden of cost and be less taxing on LLMsβ limited attention.
Iβd also be interested to try new models, including LLMs like Gemini or Llama 3 that have much larger token-limits than GPT-4. Grammar models that can identify the subject of a sentence and transformer models that paraphrase text might also be helpful.
Lastly, there are many ways to tweak methods that are worth a shot. It may have helped to find a way to weigh certain terms in keyword lists, e.g. terms that point directly to the risk vs terms that are more about the consequences of the risk. Semantic similarity scoring might also be a useful tool that could save costs in consolidating similar risks.