Decoding Corporate Conversations: Analyzing Q&A Text and 8K Filings
Understanding 8-K Filings: A Dive into Corporate Events
In the dynamic realm of finance, staying abreast of corporate events is paramount for investors aiming to make informed decisions. Each significant corporate occurrence triggers a flurry of disclosures that must be meticulously documentedΒ in what are known as 8-K filings. These filings, mandated by the US Securities and Exchange Commission (SEC), serve as a window into the inner workings of companies, offering crucial insights into pivotal moments such as acquisitions, changes in leadership, or financial distress.
Unlike the quarterly 10-K filings, which provide a comprehensive overview of a company's financial health, 8-K filings are more frequent and cater to specific events. From mergers and acquisitions to bankruptcy filings, these documents encapsulate pivotal moments in a company's journey, often accompanied by exhibits that further elucidate the disclosed information.
For my capstone project, I scraped SEC's Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system 8-K filings for later analysis.
I initially created a structured dataset comprising ticker symbols, Β Central Index Key (CIK) IDs, and company names sourced directly from the SEC. However, navigating the vast expanse of EDGAR posed its challenges. Scraping it within the restrictions of fair access necessitated prudent handling which was done by adhering to EDGARβs request limits of 5 requests per second. After the filings were scraped, they were then sorted based on key attributes, including the filing date, accession number, and the filing's URL for further analysis.
Each file required t preprocessing to reduce extraneous noise while preserving essential semantics. I combed through each filing and removed boilerplate phrases that are repetitive and devoid of substantive information such as βforward-looking statementsβ or βcheck the appropriate box β etc. Employing advanced text processing techniques, I standardized the textual data by setting the text to lowercase, removing punctuation, numbers, stopwords, and lemmatizing remaining words in order to transform the text into a format amenable to computational analysis. As a separate preprocessing step, I kept only nouns for later analysis. All 8-K filing texts were concatenated with respect to the companies.
Next I used embeddings, a mathematical representation that encapsulates the semantic essence of words in a high-dimensional numeric space. Through techniques like Term Frequency-Inverse Document Frequency (TF-IDF ), I encoded textual data into numerical vectors. TF-IDF determines how important a word is in a piece of text by scoring words based on frequency and uniqueness.Β Frequently repeated words are significant in a single document, though unique words across all documents carry higher importance than common words.
Once the texts are vectorized, cosine similarity is calculated by comparing the embeddings generated from the 8-Ks from one company to another. Cosine similarity compares how closely related two things are based on the direction and angle between their vectors in a multidimensional space. By measuring the cosine similarity between vectors representing different companies, we are able to compare different companiesΒ and their 8-K filings via a similarity matrix.
The similarity matrix shows the ten select companiesβ 8-K filings overall have a low/moderate similarity score. Cosmetic giants El and COTY and communication giants CHTR and TMUSΒ received some of the highest scores. Interestingly, COTY seems to have the highest similarity scores, suggesting the language used in their filings are more similar to other company filings.
Extracting Key Questions from Press Releases Using Text Embeddings
Having scraped and compared 8-K filings, I utilized text embeddings and cosine similarity to extract and compare key questions from press releases. The goal is to automate the identification of question segments within press releases, enabling a deeper understanding of corporate communication dynamics. Question blocks with higher similarity scores suggest questions have been asked and answered previously.
I began by identifying question segments within the press releases:
- Find the start of a Q&A section
- Identify sentences that are questions are question-like
- Navigate through changes in speakers
To begin with, I needed to understand what is a question, which is a sentence or phrase that seeks information through a reply. Typically the sentence contains βwho,β βwhat,β βwhere,β βwhen,β βwhy,β and a β?β (e.g. βHow do you plan to solve this issue?β). However, some sentences that are not phrased as questions but may be described as βquestion-likeβ should also be included. A statement like:Β βIβm curious about what you think of the new policyβ would really be a way of asking for the personβs thoughts about the policy.
Traditionally, text analysis relied on time-consuming manual categorization and keyword-based searches. However, the advent of Artificial Intelligence (AI), particularly through machine learning and natural language processing (NLP) technologies, has revolutionized text analysis. Large Language Models (LLMs) like BERT and GPT-3 have enhanced capabilities for question-answering and content generation.
I used the "Instructor" instruction-based text embedding model, capable of generating embeddings tailored to various tasks and domains without additional fine-tuning. These embeddings capture the semantic properties of data and map them to vectors of real numbers while reducing most of the preprocessing needed for other large language models.
I created text embeddings with instructions βidentify the question segments in a Q&Aβ for the text in the document and utilized similarity to compare it to the embeddings of known questions. An arbitrary threshold was set at 0.85. Setting a higher threshold reduced, but did not eliminate, false positives but setting a low threshold flagged too many non-question statements. Setting a very high threshold flagged no questions.Β
The threshold did not accurately flag questions until I readjusted my approach. I defined a simpler function to determine if a sentence is a question by using the presence of a question mark within a sentence. I also used the change in speakers and the operator to determine the start of a new question block. This methodology works well onlyΒ for financial earnings calls where thereβs an operator.
After identifying question blocks, each segment was compared to other question blocks. The average embeddings of each embedding was plotted in a similarity matrix. Darker colors indicate a lower similarity score or a more βuniqueβ question.Β
Some of the assumptions applied here should be noted. Although not always the case, questions are typically not repeated in a Q&A (unless the question was not answered). Speakers take up the bulk of the conversation, so question blocks with the same speaker may skew the cosine similarity higher.
Here, we see question 7 is most dissimilar to question 5 and generally the block indices appear to be darkening indicating similar questions are not being repeated.
In another example, texts from one earnings call were compared to a different earnings call within the same company during Covid. As this was a time in whichΒ public uncertainty and fear dominated,Β as indicated in the high similarity score/yellow block segments, certain themes were repeated, especially those pertaining to travel retail . El block 4 and El2 block 0 focused on the companyβs response to the pandemicβs impact on travel retail and the anticipated gradual recovery in the fiscal year.Β Β
Future work involves fine-tuning instructor embeddings, exploring other methodologies like zero-shot classification, and improving classification accuracy for unseen data categories.
In conclusion, extracting key questions from press releases represents a significant advancement in company analysis. By leveraging text embeddings and advanced techniques, large bodies of company text can be analyzed quickly and efficiently.
References
https://hobergphillips.tuck.dartmouth.edu/
https://www.sec.gov/edgar/searchedgar/companysearch
https://instructor-embedding.github.io/
Special Thanks
Cole Ingraham
George Ho