Automated Data Extraction and Transformation Using Python, OpenAI, and AWS
Introduction
In this project, we aim to scrape information from assorted websites, store the unrefined data, utilize OpenAI for cleaning and processing said data, and ultimately store the final result.
Tools Used
- GitHub: For version control and collaboration.
- AWS: Utilizing services like Lambda, CodePipeline, CodeBuild, CloudFormation, S3, Step Functions, EventBridge Scheduler, and Simple Notification Service for deployment and orchestration.
- OpenAI: For advanced data processing and cleaning.
- Python Packages: Including BeautifulSoup4, PyPDF, Pandas, and PDFTables for data extraction and manipulation.
Idea
Initially, when I embarked on the project, I wasn't using OpenAI at all. I was manually writing code for each website and determining what to extract. This was achievable, and the initial stages progressed as such, but the problem resided in its future-proofing. A minor alteration in the PDF or Webpage could render the entire code obsolete, compelling me to manually inspect and rewrite the code for extraction.
Having previously worked with OpenAI's API on other projects, I knew that feeding it a plethora of data to clean, filter, and subsequently return wouldn't be a significant hassle.
Check my Youtube Summarization Chatbot that I made here.
Automating Data Extraction And Transformation
Before I dive deep into the project, here is a high-level overview of what I intend to do.

The process described above took only a few seconds and cents to accomplish. If I had tried to write code to extract all the necessary information, it would have taken hours of my time. Furthermore, the introduction of a new PDF with new data would have required a rewrite of the code each time. The ability to give raw extracted data to OpenAI and receive a CSV containing all the needed information in return is truly a blessing. Additionally, the ability to loop this process to generate three CSVs (a topic I will discuss later) and ask OpenAI to select the one that best meets the prompt further makes this a robust approach.
The idea is simple, we extract information from a PDF or Webpage, perform optional conditional data filtering, check for any change in data, send that data to OpenAI and get back a CSV.

Mermaid Syntax
graph TD;
style A fill:#ffffba,stroke:#333
style D fill:#bae1ff,stroke:#333
style E fill:#ffdfba,stroke:#333
style G fill:#bae1ff,stroke:#333
style H fill:#ffb3ba,stroke:#333
style I fill:#bae1ff,stroke:#333
style J fill:#bae1ff,stroke:#333
style K fill:#baffc9,stroke:#333
A([PDF or Webpage]) --> D[Extract Data]
D --> |Optional Data Filtering| E{Check if there is previously\nextracted data in AWS S3}
E -->|No| I
E -->|Yes| G[Compare Extracted Data with Previously Extracted Data]
G -->|Identical?| H[End]
G -->|Different?| I
I[Send to OpenAI for cleaning and creation of 3 CSV files]
I --> J[Utilize OpenAI to determine the optimal CSV among the 3]
J --> K[(Store the selected\nCSV in AWS S3)]
Extracting Data From Source
In this project, we will be extracting data from various PDFs and webpages.
From PDFs with PyPDF and PDFTables
In order to extract information from the PDFs, I used either PyPDF or PDFTables.
- PyPDF is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.
- PDFTables is a cloud-based service, developed by The Sensible Code Company, that specializes in converting PDFs into editable formats like XLSX, CSV, XML, and HTML. It uses an algorithm to interpret the structures within a PDF, distinguishing between rows and columns to accurately convert data. While it doesn't perform Optical Character Recognition (OCR) for scanned documents or images, it's designed to work reliably with large data quantities. The service also provides an API for developers and supports bulk data conversion for enterprise use.
Both are great options, and I'm using both. Some PDFs feature tabular data, using PDFTables for those files is optimal. On the other hand, for PDFs comprised entirely of text, PyPDF proves to be the most effective solution.
For instance, consider this admission requirement PDF, which is composed entirely of text. In this case, I applied PyPDF.

However, in this document, we encounter tabular data, hence I utilized PDFTables.

From Webpages Using BeautifulSoup4
Extracting information from webpages is tricky. As of writing this blog, there are a total of 16 webpages with different HTML structures from which to extract data. I realized that since I'll be using OpenAI, all I need is to devise a universal function that takes two parameters: the initial string or sentence from which extraction begins, and the concluding point at which it ends.
Let's look at this webpage for example.

I wish to extract all information ranging from 'Tuition & Fees' to 'Application Deadlines.'
After numerous iterations, I've developed this function that, to date, has successfully extracted all content between two sentences within a webpage:
def extract_content(soup, start_string, end_string, include_end=False):
content = ''
start = soup.find(string=start_string)
end = soup.find(string=end_string)
if not start:
raise ValueError('Start marker not found in content.')
if not end:
raise ValueError('End marker not found in content.')
for element in start.parent.find_all_next(string=True):
if element == end:
if include_end:
content += element.strip() + '\n'
break
if isinstance(element, NavigableString):
stripped = element.strip()
if stripped:
content += ' ' + stripped
return content.strip()
The function above is designed to extract content between two specified strings, 'start_string' and 'end_string'. It navigates through the document starting from 'start_string', includes all the textual content in between, and stops before 'end_string' (or includes 'end_string' if 'include_end' is set to True).
If the specified start or end strings cannot be found, the function raises a ValueError.
Revisiting the example, we can apply the function as follows:
extract_content(soup, 'Tuition & Fees', 'Application Deadlines')
The result we obtain is:
Tuition & Fees Estimated Tuition Rates 2022-2023 This chart estimates the cost of attending CIAM for two years as a graduate student. For a significant proportion of our students, these expenses are offset by scholarships and aid. Keep in mind that your total cost may vary depending on your personal expenses and the degree or concentration you choose. Master's of Business Administration in Executive Management Tuition Full-Time Status (2 courses per term) $4,000 per term Part-Time Status (1 course per term) $2,000 per term Tution Costs Per Credit $667 Total Tuition: $24,000 for entire program Student Fees 2022-2023 Category: Fees: Application Fee (Non-Refundable) $75 STRF Fee (Non-Refundable) $60 Estimated Total Textbook Cost No Additional Cost Estimated Parking Fee No Additional Cost INT/CPT Registration Fee $100* Total Estimated Program Associated Fees: $235 *Students are required to successfully complete INT501 & INT599 to graduate, with a $50 registration fee for each course; international students require CPT approval before registering for INT courses. Cost of Attendance Total Tuition: $24,000.00 for entire program Total Estimated Program Associated Fees: $235.00 TOTAL ESTIMATED PROGRAM COSTS (TUITION + FEES): $24,235.00
Excellent! Now, we can conveniently forward this text to our prompt.
Optional Data Filtering
Data extracted from various sources is often cluttered with superfluous characters, excessive white spaces, newlines, and repetitive punctuation such as periods. Let's revisit this PDF from before.

Notice all the "...." between the description and date. Running PyPDF on it will retain all these extra periods and give us:
SUMMER 2023 (23/S1 AND 23/S2) 23/S1 Summer A classes begin ................................................Thurs., May 11 Last day to add ............................................................................ Mon., May 15 Mark of “W” begins for course withdrawals2 ............................... Tues., May 16 Last day to drop with a “W” 2 ...................................................... Wed., June 14 23/S1 MS_MGT Executive Residency .................... Fri. June 23- Sat. June 24 23/S1Summer A classes end .................................................... Wed., June 28 23/S1Summer B classes begin ................................................... Mon., July 10 Last day to add ............................................................................ Wed., July 12 Mark of “W” begins for course withdrawals ................................
Considering the advancements in OpenAI's latest models and their extensive context windows, we could practically avoid any cleaning and transmit this data as is. But with the power of regex, we can clean this data and make it look much better in a few lines of codes.
def StFrancisCollege_deadline(text):
lines = text.split('\n')
new_lines = [re.sub('\.{3,}', '==', line) for line in lines]
new_lines = [re.sub(r'(\d+)\s+(\d+)', r'\1\2', line) for line in new_lines]
new_text = '\n'.join(new_lines)
return new_text
The above Python code splits the input text into lines, substitutes every sequence of three or more consecutive periods with two equal signs, removes spaces between sequential number groups on every line, and then joins all the lines back together.
Check out the difference it makes.

We now possess our final, cleaned, raw data, prepared for forwarding to a prompt!
Prompts
Having successfully extracted our raw data from a PDF or webpage, and completed the optional data filtering, we are now ready to evaluate the prompts to which we will be dispatching data. For this project, I used three unique prompts: the first to extract tuition fees and details, the second to extract important requirement information, and the third to extract deadline dates.
Let's look at the 'extract_deadline_to_csv' prompt.
def extract_deadline_to_csv(text, add_ins, programs):
if add_ins is not None:
add_ins = ', '.join(add_ins)
prompt = (
f"We have extracted academic dates and schedules from a source. "
f"We need to format this information into a structured CSV format suitable for prospective students interested "
f"in the '{programs}' program. "
f"Please discard any undergraduate information. If any information applies to both undergraduate and graduate "
f"students, include it but rename it as graduate information."
f"The goal is to ONLY identify the program/class start and end dates, "
f"as well as application deadlines (if any). "
f"Depending on the content, you may need to use the following columns: 'Term', 'Date', 'Session', and 'Type'. "
f"Maintain this column order in the CSV. "
f"If the dates are given in ranges, for instance, '2024-05-20 - 2024-07-14', split them into two separate "
f"rows: one for 'Start Date' and one for 'End Date'. If a data item includes commas, enclose it in quotes."
f"Ensure all dates are in 'YYYY-MM-DD' format. Terms should be expressed fully, like 'Fall 2023' or 'Spring "
f"2023'. If a term is presented as 'Spring II: April 1, 2024', it represents 'Term' as 'Spring 2024', "
f"'Session' as '2', 'Date' as '2024-04-01', and 'Type' could be deadline, start date, end date, etc."
f"If it just says for example 'Summer 2024 Term', then that means the 'Session' is '1'. "
f"If it says for example 'Spring 2025 Full Session', then that means the 'Session' is 'Full'. "
f"Additional instructions for this specific data extraction are as follows: {add_ins}."
f"Avoid any additional bibliographical conversation or content in the response. "
f"CRITICAL: Only return back the CSV. "
f"The data for conversion is:\n'{text}'."
)
logger.debug(f"Returning prompt: {prompt}")
return prompt
'text' is simply the extracted information, 'add_ins' is a list of strings that I can optionally include for better responses and specificity, and programs are simply a list of program names that I will pass through the events.json so the generated response only include information regarding those programs.
Putting it all together, this is what it looks like.

In the figure above, we observe the inputs—'text', 'add_ins', and 'programs' - being fed to the function 'extract_deadline_to_csv'. The debug console displays the confluence of these inputs, resulting in an extensive prompt. Similarly, the other two functions, 'extract_tuition_to_csv' and 'extract_requirement_to_csv', execute parallel operations. For brevity and focus in this article, we won't delve into these here. However, for a comprehensive view, feel free to explore the full prompts on the GitHub repository.
OpenAI
Generating Responses in CSV Format
So far, we have extracted data from a PDF or webpage, performed some optional filtering, and thrown it all in a function to get one long string of instructions that we call a prompt. All that is left to do is send that over to OpenAI.
After extensive testing and debugging, here are some strategies for generating accurate responses from OpenAI in CSV format.
Model Selection
At the time of writing this blog, GPT-4-Turbo is the latest model and the one that worked best for my use case. I tried GPT-3.5-Turbo and GPT-4, but both presented issues such as omitting important details and consistently producing invalid CSV formatting. GPT-4-Turbo was the most consistent in following the prompt and was the model I selected.
Temperature Tuning
After multiple tests from 0.1 to 1, I found that a temperature value of 0.4 provided the most optimal results. For this particular task of generating accurate CSV format responses, controlling the AI's randomness was crucial, hence a temperature of 0.4 was chosen.
CSV Formatting
Typically, GPT will insert contextual conversation or content before or after the needed response, such as "Sure, here is the CSV," or "The above includes the CSV with..." To counteract this tendency, the inclusion of specific instructions in the prompt becomes crucial. In my scenario, the usage of these two sentences consistently led to GPT producing a CSV-formatted response exclusively: "Avoid any additional contextual conversation or content in the response. CRITICAL: Return only the CSV. ". In my tests, using the word CRITICAL proved to be important, as the omission of that word would sometimes result in text appearing before or after.
Now let's send that prompt over to OpenAI using GPT-4-Turbo and temperature 0.4 and see what we get back.

To put the above illustration in simpler terms,
- We send the prompt to OpenAI.
- OpenAI generates a response.
- We remove the 'csv' from the start and the triple backticks from the start and the end of the response.
At this stage, we have one final check to do before proceeding with generating the remaining responses: ensuring the response can be rendered into a dataframe, i.e., it is a valid CSV.
CSV Validity Check
To achieve this, I use a 'try' and 'except' block. If the response is invalid, the execution of 'pd.read_csv(x_data, sep=',')' will fail, resulting in a ParserError being raised. By using 'except pd.errors.ParserError', we simply discard the response, generate a new one, and continue this process until no ParserError is raised.
The process iterates until we receive three responses in a valid CSV format, all of which are stored in the output_texts list. Subsequently, we allocate the variables csv1, csv2, and csv3 correspondingly to these three.
Here is the process flowchart once more for enhanced visualization.

Mermaid Syntax
graph TD;
style A fill:#ffffba,stroke:#333
style B fill:#bae1ff,stroke:#333
%% style C fill:#bae1ff,stroke:#333
style D fill:#ffdfba,stroke:#333
style E fill:#bae1ff,stroke:#333
style F fill:#ffb3ba,stroke:#333
style G fill:#ffdfba,stroke:#333
style H fill:#baffc9,stroke:#333
A([Send prompt to OpenAI]) -->B[OpenAI generates response]
%% B --> C[Clean response by removing\ncsv and triple backticks]
B --> |Clean response|D{Can response be converted to\nDataFrame in pandas?}
D -->|Yes| E[Store cleaned response in list]
D -->|No, ParserError| F{{Discard response}}
F --> B
E --> G{Have 3 responses?}
G -->|No| B
G -->|Yes| H[Assign responses to csv1, csv2, csv3]
Choosing the Best Response (or CSV)
We now have three potential responses to evaluate and select the one that aligns best with our given prompt. For this task, I've leveraged the capabilities of OpenAI's GPT-4, setting the temperature parameter to 0.2. In my experience, using OpenAI's models, particularly GPT-4, has yielded excellent results for decision-making tasks. With sufficient context, it consistently makes selections akin to human judgment.
To choose the best response, this is the prompt I'm using.
def choose_best_csv(used_prompt, csv1, csv2, csv3):
prompt = (
f"Here is the prompt I used:\n\n'{used_prompt}'\n\nUsing it, I obtained three CSVs. "
"The task is to return ONLY the number (nothing else) - 1, 2, or 3 - of the CSV that adhered most closely to "
"the prompt. Evaluate each CSV and return only the number of the one that best follows this prompt. "
"Here are the CSVs:\n\n"
"CSV 1:\n"
f"'{csv1}'\n\n"
"CSV 2:\n"
f"'{csv2}'\n\n"
"CSV 3:\n"
f"'{csv3}'"
)
logger.debug(f"Returning prompt: {prompt}")
return prompt
Seems fairly simple, doesn't it? We provide GPT-4 with the initial prompt and three responses, then instruct it to return either 1, 2, or 3. This is how the entire choose_best_csv function appears.

We continue to execute this prompt until GPT-4 selects the identical CSV as the optimum choice twice. Below is a flowchart illustrating this procedure.

Mermaid Syntax
graph TD;
style B fill:#ffffba,stroke:#333
style C fill:#bae1ff,stroke:#333
style E fill:#ffdfba,stroke:#333
style F fill:#baffc9,stroke:#333
B([csv1, csv2, csv3])
C[GPT-4 Selects Best CSV] --> |Append Selected CSV to List| E
E{Any CSV Appear\nTwice in List?} -- Yes --> F[End with Repeated CSV as Result]
E -- No --> C
B --> C
At last, we possess a CSV which outshines the other three, encompassing all the necessary information we require.

AWS
Setting all this up on AWS was pretty straightforward. I'll break down each step I took to set this all up on the cloud.
AWS CodePipeline
When there's a change to the GitHub repository, AWS CodePipeline is triggered. It orchestrates the rest of the operations, first initiating AWS CodeBuild.
buildspec.yml
version: 0.2
phases:
install:
runtime-versions:
python: 3.11
build:
commands:
- echo "******** Running SAM Build ********"
- sam build
- echo "******** Running SAM Package ********"
- sam package --output-template-file packaged.yaml --s3-bucket school-scraper-build-bucket
artifacts:
files:
- packaged.yaml
discard-paths: yes
template.yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
School Web Scraping Project
Globals:
Function:
Timeout: 900
Resources:
SchoolScraper:
Type: AWS::Serverless::Function
Properties:
FunctionName: SchoolScraperFunction
CodeUri: SchoolScrapingProject/
Handler: app.lambda_handler
Runtime: python3.11
Role: arn:aws:iam::654654270857:role/School-Scraper-Role
Architectures:
- x86_64
AWS CodeBuild
CodeBuild executes the steps laid out in the buildspec.yml file. It begins with the sam build command, which uses the template.yaml file to understand the structure and resources of the serverless application. The template.yaml file defines the serverless application and the resources it uses, such as the lambda function, its properties, and the role it assumes.
After the build, CodeBuild runs the sam package command, which packages the built application and uploads it to an S3 bucket, generating a new packaged.yaml file. This new packaged.yaml replaces any references to local artifacts in the template.yaml with S3 locations where the build artifacts were uploaded.
AWS S3
S3 serves as the storage service where the serverless application artifacts are stored during the build process. It stores the packaged.yaml file, which is an AWS CloudFormation-ready Serverless Application Model (SAM) template.
AWS CloudFormation
Once the packaged.yaml is available, AWS CodePipeline triggers AWS CloudFormation. CloudFormation reads the packaged.yaml file from the S3 bucket and uses it to deploy (or update) the AWS resources (such as our AWS Lambda function) specified in that template.

Mermaid Syntax
graph TD
style A fill:#ffffba,stroke:#333
style B fill:#bae1ff,stroke:#333
style Z fill:#bae1ff,stroke:#333
style D fill:#bae1ff,stroke:#333
style E fill:#bae1ff,stroke:#333
style F fill:#bae1ff,stroke:#333
style G fill:#bae1ff,stroke:#333
style H fill:#baffc9,stroke:#333
A([GitHub Repository])
B[AWS CodePipeline]
Z[AWS CodeBuild]
D[Build Application]
E[Package Application]
F[(AWS S3)]
G[AWS CloudFormation]
H[AWS Lambda]
A -.-> Z
A-->|Change detected|B
B-->.
subgraph .
Z-->|Uses buildspec.yml|D
D-->E
end
E-->|Upload packaged.yaml|F
F <--> |Read packaged.yaml|..
subgraph ..
G-->|Deploy/update resources|H
end
Once CloudFormation has been successfully deployed, a Lambda function will be in place. This function will then be configured for compatibility with AWS Step Functions. However, before moving forward, it is essential to understand the operation of a Lambda function.
AWS Lambda Function
AWS Lambda is a service that runs your code in response to events without requiring server management. An event, which triggers the code, is a JSON document carrying input data such as changes in a file or database. The context is an object containing details about the runtime environment, like function name, execution time, and memory limits.
The event JSON is a very important part of a lambda function.
Event JSON
The JSON for my event is composed of two main sections: config and schools. We'll examine the config section first.
{
"config": {
"region": "us-east-2",
"source_bucket": "lighthousecpt-schools-source",
"csv_bucket": "lighthousecpt-schools-csv",
"extra_csv_bucket": "lighthousecpt-schools-csv-extra",
"openai_api_key": "LIGHTHOUSECPT_OPENAI_API_KEY",
"pdftables_api_key": "BILAL_PDFTABLES_API_KEY",
"mode": "all"
}
}
In my attempt to minimize hardcoding variables, I have implemented a configuration dictionary that includes the following:
- region: The AWS region where the resources are hosted.
- source_bucket: The S3 bucket for storing raw data extracted from PDFs or webpages.
- csv_bucket: The S3 bucket for storing the selected (best) CSV.
- extra_csv_bucket: The S3 bucket for storing the two additional CSV files (the rejects).
- openai_api_key: The parameter name for the OpenAI API key.
- pdftables_api_key: The parameter name for the PDFTables API key. Both the pdftables_api_key and openai_api_key represent the parameter names for PDFTables and OpenAI respectively, with the actual API keys stored in the Amazon Parameter Store. These keys are retrieved via the get_parameter method of boto3.
- mode: This determines the operational mode of the application, which can either be all to instruct the scraper to run on all schools or a specified list of one or more schools such as TrineUniversity or TrineUniversity, SofiaUniversity.
Maintaining these configurations in a dictionary enhances the flexibility and adaptability of the code, accommodating different environments and use cases. Such an arrangement also bolsters code readability and maintainability.
Now, let's examine the 'schools' dictionary (showcasing just one of the schools).
{
"schools": [
{
"name": "SofiaUniversity",
"programs": [
"MBA",
"MS in Computer Science",
"Doctor of Business Administration (D.B.A)"
],
"tuition": "PDF_TXT",
"requirement": "PDF_CSV",
"deadline": "https://www.sofia.edu/admissions/"
}
]
}
This 'schools' dictionary is a comprehensive compilation of all school names, programs, and types to scrape from, encompassing tuition, requirements, and deadlines. Let's dissect its components.
- name: This represents the school's name. It must be included in the list of schools (derived from reading all the directory names, more on this later).
- programs: This is a list of programs. The inclusion of programs enhances the precision and focus of the CSV returned by OpenAI.
- tuition, requirement, and deadline: These values can be configured as:
- A URL to a webpage, or a dictionary of URLs.
- PDF_TXT implies that the lambda will search for a PDF file in S3, utilize PyPDF to transcribe it into a TXT file, store it in S3, and then transmit that data to OpenAI.
- PDF_CSV indicates that the lambda will hunt for a PDF file in S3, employ PDFTables to convert it into a CSV file, store it in S3, and then relay that data to OpenAI.
- SKIP signifies that we do not need to extract data from certain schools. The "SKIP" option lets us effortlessly circumvent these.
Okay, I hope everything is making sense so far and everything is coming together.
Our Lambda function is prepared and, given the appropriate event JSON, it operates, scrapes, generates CSVs, and stores them. The only three tasks remaining are to set up AWS StepFunction (one step for each school), AWS EventBridge Scheduler (to automate the execution of the StepFunction once per week), and AWS SNS (to receive crucial logs via email).
AWS StepFunction and EventBridge Scheduler
First, we create a new state machine. Given that there are eight schools, we utilize the AWS Lambda Invoke action eight times, neatly tucked between the Start and End. Each action is correspondingly assigned the name of a school.

The JSON payload was also entered for each. For instance, consider SofiaUniversity:
{
"config": {
"region": "us-east-2",
"source_bucket": "lighthousecpt-schools-source",
"csv_bucket": "lighthousecpt-schools-csv",
"extra_csv_bucket": "lighthousecpt-schools-csv-extra",
"openai_api_key": "LIGHTHOUSECPT_OPENAI_API_KEY",
"pdftables_api_key": "BILAL_PDFTABLES_API_KEY",
"sns_topic_name": "MyTopic",
"mode": "SofiaUniversity"
},
"schools": [
{
"name": "SofiaUniversity",
"programs": [
"MBA",
"MS in Computer Science",
"Doctor of Business Administration (D.B.A)"
],
"tuition": "PDF_CSV",
"requirement": "PDF_CSV",
"deadline": "https://www.sofia.edu/admissions/"
}
]
}

AWS SNS
Receiving notifications about any errors during the execution of the StepFunction is crucial. To achieve this, I leveraged AWS SNS (Simple Notification Service). I created an ErrorTopic and subscribed myself to it, then utilized sns.publish in my lambda function whenever a ValueError is raised, like so:
except ValueError as e:
error_message = str(e)
trace = traceback.format_exc()
# Construct the message that will be sent on the SNS
error_log = {
'error_message': error_message,
'stack_trace': trace
}
# Send the error log to the SNS Topic
send_email('ErrorTopic', f'Error Logs - {datetime.today().strftime("%Y-%m-%d")}', error_log)
raise # re-throwing the exception after sending notification
I meticulously crafted the send_email function, which is presently employed in two distinct scenarios: first, to dispatch error logs to myself as the developer, and second, to transmit both informational logs to myself and the client.
def send_email(topic_name, subject, message):
sns = boto3.client('sns')
# get the topic ARN
response = sns.create_topic(
Name=topic_name
)
topic_arn = response['TopicArn']
# Format each item in the message dictionary, separated by blank lines
formatted_message_parts = [f"{key}:\n{value}" for key, value in message.items()]
formatted_message = "\n\n".join(formatted_message_parts)
sns.publish(
TopicArn=topic_arn,
Subject=subject,
Message=formatted_message
)
Here is the email I receive when there is an error.

For the informational logs, here is what both I and the client receive—one email for each school.

Conclusion and Future Work
Over the course of three months, I've completed this comprehensive project, and I continue to refine it with minor adjustments. AWS, being a colossal cloud service provider, has an extensive learning curve. Yet, I find immense satisfaction and gratitude in the substantial progress I have made within this timeframe.
This project has taught me the importance of future-proofing design, leveraging powerful AI models for data processing, and automating workflows using cloud services. The ability to seamlessly extract, clean, and process data from a variety of sources has numerous applications and can significantly streamline data-driven tasks in various fields.
Looking Ahead
There are several directions and improvements I plan to pursue:
- Expand Supported File Formats: Introduce support for additional document formats like DOCX, XLSX, and others, broadening the range of data sources that can be processed.
- User Interface Development: Develop a user-friendly interface where users can upload files directly instead of going to AWS S3. This could make the tool accessible to users with less technical expertise.
- Model Fine-Tuning and Custom Prompts: Experiment with fine-tuning OpenAI models specifically for document types and structures you frequently encounter. This could improve the accuracy and relevance of the processed data.
Acknowledgments
I’d like to thank my NYCDSA instructors and peers for their support and guidance throughout this project. Your feedback and encouragement have been invaluable.
Thank you for taking the time to read about my project. If you have any questions or feedback, please feel free to reach out or visit my GitHub repository. Your input is greatly appreciated!