Scraping NSF Awards to Create Database of Active STEM Researchers

Posted on Nov 21, 2016

Introduction

There a numerous use cases for having a searchable database of active STEM (Science Technology Engineering and Math) researchers. For example, targeted marketing by companies interested in selling services and products, or helping students select the best research institution and mentors based on their interest. Further, it can to provide an additional means of getting an overview of active research areas. Unfortunately, no so such database is readily available even though most of the needed information is readily on various websites i.e. faculty profiles, journal sites, etc.

An obvious approach for creating this database is of course manually searching through the above mentioned websites. However, this would be very time consuming. A more efficient method would be to use web scraping to automate the data extraction process. Unfortunately, a small scale test indicated that there is just too much format variation in the relevant webpages to make web scraping practical. For example, even within departments at the same institution, there is often different formats used for faculty profiles. Moreover, since these profiles are infrequently updated, the information they contain is often out of date.

Approach

To overcomes the challenges listed above, a much better approach is to Follow The Money by scraping grant award information from research funding agencies. Not only is award data provided in a consistent format, more importantly, such information provides a direct measure of a researcher's activity level. For this project, the award page of the NSF was scraped. The process, using the Scrapy Framework, is outlined in the figure below.

resnag

 

Briefly, a web spider visits links on the wards page, downloading zip files containing award information which is stored in XML files. After unzipping, relevant information is extracted by the processing pipeline then stored in a MongoDB. Use of MongoDB allows for a scalable, fulltext searching to be done.

Use Case -- Direct Marketing

As mentioned above, one use case for such a database is direct marketing. Imagine if you will, a small company, Acme AFMs, who developed a new instrument and want to know if it’s worth going into full-scale production. The first thing they do is establish there is active demand by searching for the grants which are related to AFMs for the last sixteen years. After establishing there is amply demand based on the number of grants awarded, the next step is to get a list of the most active faculty/institutions who use AFMs. Again, this information that's easily obtained from the database. The figures below provides a visualization of these search results.

resnag resnag

Conclusion

Web scrapping of grant awards information to create a searchable database of active STEM researchers was successfully done using the Scrapy Framework and MongoDB. Test use cases clearly show the value of this system and its potential. The next steps for this project are to:

  • Develop Interactive web application
  • Use machine learning for keyword tagging of awards
  • Add additional data from other granting agencies and publications
  • Explore predictive modeling based on this data

About Author

Nathan Stevens

Nathan holds a Ph.D. in Nanotechnology and Materials Science from the City University of New York graduate school, and has worked on numerous software and scientific research projects over the last 10 years. Software projects have ranged from...
View all posts by Nathan Stevens >

Related Articles

Leave a Comment

M_Coleman April 7, 2017
Girls wanted, no matter where you live! - to high paid job. If you are daring and young women between 18-40 years old, you can earn $1000/week, when you get experienced you can make 3 times more. I won't spam any websites here, if you are interested, you can google it: Jevlo's jobs modeling
Rion Dooley December 12, 2016
There seems to be some interest both inside and outside of NSF in making this data public. Is this project available as OSS? Is there an opportunity to collaborate and move it forward?

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI