Scraping NSF Awards to Create Database of Active STEM Researchers

Nathan Stevens
Posted on Nov 21, 2016

Introduction

There a numerous use cases for having a searchable database of active STEM (Science Technology Engineering and Math) researchers. For example, targeted marketing by companies interested in selling services and products, or helping students select the best research institution and mentors based on their interest. Further, it can to provide an additional means of getting an overview of active research areas. Unfortunately, no so such database is readily available even though most of the needed information is readily on various websites i.e. faculty profiles, journal sites, etc.

An obvious approach for creating this database is of course manually searching through the above mentioned websites. However, this would be very time consuming. A more efficient method would be to use web scraping to automate the data extraction process. Unfortunately, a small scale test indicated that there is just too much format variation in the relevant webpages to make web scraping practical. For example, even within departments at the same institution, there is often different formats used for faculty profiles. Moreover, since these profiles are infrequently updated, the information they contain is often out of date.

Approach

To overcomes the challenges listed above, a much better approach is to Follow The Money by scraping grant award information from research funding agencies. Not only is award data provided in a consistent format, more importantly, such information provides a direct measure of a researcher's activity level. For this project, the award page of the NSF was scraped. The process, using the Scrapy Framework, is outlined in the figure below.

resnag

 

Briefly, a web spider visits links on the wards page, downloading zip files containing award information which is stored in XML files. After unzipping, relevant information is extracted by the processing pipeline then stored in a MongoDB. Use of MongoDB allows for a scalable, fulltext searching to be done.

Use Case -- Direct Marketing

As mentioned above, one use case for such a database is direct marketing. Imagine if you will, a small company, Acme AFMs, who developed a new instrument and want to know if it’s worth going into full-scale production. The first thing they do is establish there is active demand by searching for the grants which are related to AFMs for the last sixteen years. After establishing there is amply demand based on the number of grants awarded, the next step is to get a list of the most active faculty/institutions who use AFMs. Again, this information that's easily obtained from the database. The figures below provides a visualization of these search results.

resnag resnag

Conclusion

Web scrapping of grant awards information to create a searchable database of active STEM researchers was successfully done using the Scrapy Framework and MongoDB. Test use cases clearly show the value of this system and its potential. The next steps for this project are to:

  • Develop Interactive web application
  • Use machine learning for keyword tagging of awards
  • Add additional data from other granting agencies and publications
  • Explore predictive modeling based on this data

About Author

Nathan Stevens

Nathan Stevens

Nathan holds a Ph.D. in Nanotechnology and Materials Science from the City University of New York graduate school, and has worked on numerous software and scientific research projects over the last 10 years. Software projects have ranged from...
View all posts by Nathan Stevens >

Related Articles

Leave a Comment

Avatar
M_Coleman April 7, 2017
Girls wanted, no matter where you live! - to high paid job. If you are daring and young women between 18-40 years old, you can earn $1000/week, when you get experienced you can make 3 times more. I won't spam any websites here, if you are interested, you can google it: Jevlo's jobs modeling
Avatar
Rion Dooley December 12, 2016
There seems to be some interest both inside and outside of NSF in making this data public. Is this project available as OSS? Is there an opportunity to collaborate and move it forward?

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp