NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Can scraping data and NLP techniques aid your understanding?

Can scraping data and NLP techniques aid your understanding?

Michael Griffin
Posted on Jan 5, 2020
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Spoiler: a little, but you still need to read the research!

Thirty-second summary

  • I used a web scraper to extract publicly available research content from two of the top machine learning techniques (NeurIPS and ICML) over the period 2007-19 to generate a rich dataset of ~12,000 texts
  • Unsupervised topic modelling is used to explore clustering of terms across research areas; however, manual topic creation more clearly demonstrates trends over time
  • I also experimented with recent transfer learning techniques to develop a language model to generate โ€œfake abstractsโ€ which might be quite hard to distinguish for non-experts
  • Tools used: python, Google Colab, Scrapy, gensim, spaCy, regex, fast.ai libraries

Introduction

As a data scientist in training, I needed a fun project to experiment with scraping tools and Natural Language Processing (NLP) techniques. I also wanted a reason to familiarise myself with important machine learning (ML) research - so I designed an end-to-end process looking at progress in the area. This blog post will be structured to cover:

  1. Data collection - scraping the dataset from conference pages
  2. Text understanding - using a variety of techniques to explore language clustering and trends over time
  3. Text generation - experimenting with more recent development in language models
  4. Summary of key insights

My analysis focuses on content from two of the top annual conferences in the machine learning community: the Conference and Workshop on Neural Information Processing Systems (NeurIPS, formerly NIPS) and International Conference of Machine Learning (ICML). Both have grown substantially in recent years by various measures (attendees, paper submissions etc) - my analysis looks at various types of scheduled content (like poster sessions or workshops) and the growth in volume here is clear.

This post explores techniques taught in the NYC Data Academy as well as transfer learning models outlined in the FastAI NLP course (which I thought was fantastic top-down resource, available here)

1) Data collection

Core data is sourced from the NeurIPS and ICML schedule pages - conveniently these have very similar structures and formats, but I couldnโ€™t find existing datasets collating this information (although some posts suggest people do this to optimise their own schedules at these busy conferences!).

I used the scrapy package to crawl c.12,000 web-pages to collate information on all session types including title, author, abstract and links - this proved straightforward with a combination of xpath and regex. The schedule captures a variety of session types, mainly poster sessions on specific papers but also varieties of talks and workshops.

I also explored data enrichment with a Kaggle dataset (here) which covers some NeurIPS information from 1987-2017 and helpfully includes the full paper text.  Merging this dataset (using shared titles) means I have mixed availability through time - most of my analysis focuses on abstracts and titles in the period 2006-19 to maximise consistency and recency.  Note this does create a bias to more recent years in the language modelling since there is simply more content, but I consider this to be ok.

2) Text understanding

Automated approach

First off, some simple word-clouds demonstrate how the key language has shifted between 2006 and 2019. To obtain these I processed the data in the standard manner to remove stop words and apply lemmatisation (see appendix).

It's clear immediately that "deep" networks and "reinforcement" approaches are referenced more now, while "Bayesian" and "inference" techniques seem less prevalent. "Representation" learning and issues relating to "efficiency" or "optimization" are also more frequently discussed.

These sessions are not tagged or labelled in any way so it's difficult to explore trends. โ€‹Topic modelling via Latent Dirichlet Allocation (LDA) offers an unsupervised approach to examine clustering of words within documents. For a specified number of topics, clusters of co-occurring keywords are created - so topics represent a distribution of words and documents capture a distribution of these topics.โ€‹

After parameter tweaking (see appendix) the approach does yield some success, generating ten categories which occur with similar frequency and have limited overlap. These are portrayed below via the pyLDAvis library as bubbles where the distance reflects a measure of the separation between topics.

Some of these topics have intuitive interpretations - I highlight topic 8 which is well-separated and captures a number of terms associated with reinforcement learning (like "agent", "reward", "RL", "environment", "exploration", "policy").

However, interpretation of many of the other clusters is more difficult and highly subjective. Much of the language is common across topics and the outcome appears very sensitive to the modelling configuration, so I'm hesitant to rely on this clustering.

I also applied sentiment analysis using textblob to explore any trends in optimism or pessimism - unsurprisingly the conference content was highly objective with medium-low polarity across time, in line with academic writing styles.

Manual approach

Using automated clustering does point a way forward - intuitive groupings can instead be generated by manually defining specific dictionaries. This does require some basic domain knowledge but I was helped by the LDA groupings and wordclouds above.

I considered references to different techniques/model architectures, domains/applications and specific datasets, looking at the % of summaries (abstracts) containing each topic by year. My approach used short dictionaries with highly specific terms to minimise false positives; this does mean that proportions are probably underestimated. Linear trends lines are used as more complex profiles are difficult to discern given the data volumes.

Whilst the approach is simplistic and reliant on the choice of dictionary terms, a few trends do jump out - the rise in "deep" "neural networks" specifically generative adversarial networks (GANs) or convolutional neural nets (CNNs), and the growing references to reinforcement learning.

With static or slowly declining references to classic statistical approaches:

The approach suggests NLP and computer vision are the most commonly referenced domains. There is also some evidence that game-related analysis and ethical considerations are of growing relevance.

I also looked at the references to key benchmark datasets in the computer vision, hinting at the growing importance of CIFAR, MNIST and ImageNet benchmarks.

3) Text generation

As a somewhat separate challenge, I also wanted to explore the other side of NLP: text generation. This requires more complex modelling as the context and sentence structure matters unlike the frequency-based approaches explored above.

For this, I explore language modelling using transfer learning which requires a different pipeline:

  1. Switch back to raw text feeds (so no removal of stop words or lemmatisation)
  2. Use a pre-trained language model based on the English text in wikipedia (see appendix for details)
  3. Train the weights in the final layer using the documents in my NIPS/ICML dataset, to predict the next word on the basis of the prior word sequence
  4. Unfreeze the full set of weights and retrain the full network over c.10 epochs
  5. Seed the language model with an introductory phrase and generate a set of fake abstracts, using some randomness to produce variety (via the temperature metric)

Example

Picking an example seeded with the phrase โ€œWe present":

We present a novel learning framework for incrementally learning in the state - space representation of a nearby channel 's learnable distribution . We show that this approach naturally generalizes the canonical exploration model to a wider class of structures than the possible neural network . Demonstrate a generalization of the sparse coding model to low dimensional sparse spaces . We demonstrate the effectiveness of our approach in a large range of source and target domains .

I think the result is quite good. Whilst there is no coherent meaning to this text, the structure of the sentences and overall passage seems reasonable and most of the phrases are plausible. I think this could feasibly pass as a real (yet incomprehensible) abstract for some audiences!

Note that this was just a quick experiment using the abstracts across all years - a more coherent passage might be achievable by training on a narrow topic and year group, using the full text from the papers as well.

4) Summary

This project has been a fun experiment with scraping and NLP which offers a few tentative insights:

  • There do seem to be distinct vocabularies in some areas, especially reinforcement learning. But much of the technical language seems to generalise across approaches and domains
  • Many of the publicised trends - ubiquitous neural networks, "deep" everything, the importance of computer vision and key benchmarks - are visible through simple analysis of the references in conference content
  • Automated clustering is hard; domain knowledge proved more useful for categorising the texts
  • Incoherent but plausible abstracts can be generated using transfer learning

Whilst key research papers and trends are summarised elsewhere, this process has highlighted several niche topics of particular interest to me which I may not have otherwise discovered. More importantly, I now have a broad and categorised dataset covering ML research over the last 12 years - this should serve as a useful resource to accelerate my data science learning.

Further work

There's plenty more work I would like to do in this area - the most obvious extension would look at author and citation information to explore networks and academic vs industry contributions. I did start to exploreโ€‹โ€‹ this but was constrained by request limits via Google scholar.

I would also consider redirecting this framework to another research area. For instance, could I quickly re-purpose the scraper and NLP code to explore trends in microeconomics research?

Appendix: technical notes

  • Best results were obtained with the longer list of stop-words from spaCy (~330 words) with additional terms common to this source (~25 words like "program", "algorithm", "learning", "problem", "analysis")
  • This analysis uses the fast.ai libraries with associated tokenisation. The language model is an RNN with the default dropout parameters and LSTM modules; underlying model here.
  • For the LDA I explored optimisation across two parameters: the number of topics in the range 2-30 and the decay value in the range 0.5-1. The selected parameters covered 10 topics with alpha of 0.6

About Author

Michael Griffin

Mike Griffin is training at the NYC data academy and has several years of experience in strategy/analytics roles in finance. He studied Natural Sciences (Physics) at the University of Cambridge and Management at the Judge Business School. Mike...
View all posts by Michael Griffin >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application