Scraping NSF Awards to Create Database of Active STEM Researchers

Nathan Stevens

Posted on Nov 21, 2016

Introduction

There a numerous use cases for having a searchable database of active STEM (Science Technology Engineering and Math) researchers. For example, targeted marketing by companies interested in selling services and products, or helping students select the best research institution and mentors based on their interest. Further, it can to provide an additional means of getting an overview of active research areas. Unfortunately, no so such database is readily available even though most of the needed information is readily on various websites i.e. faculty profiles, journal sites, etc.

An obvious approach for creating this database is of course manually searching through the above mentioned websites. However, this would be very time consuming. A more efficient method would be to use web scraping to automate the data extraction process. Unfortunately, a small scale test indicated that there is just too much format variation in the relevant webpages to make web scraping practical. For example, even within departments at the same institution, there is often different formats used for faculty profiles. Moreover, since these profiles are infrequently updated, the information they contain is often out of date.

Approach

To overcomes the challenges listed above, a much better approach is to Follow The Money by scraping grant award information from research funding agencies. Not only is award data provided in a consistent format, more importantly, such information provides a direct measure of a researcher's activity level. For this project, the award page of the NSF was scraped. The process, using the Scrapy Framework, is outlined in the figure below.

Briefly, a web spider visits links on the wards page, downloading zip files containing award information which is stored in XML files. After unzipping, relevant information is extracted by the processing pipeline then stored in a MongoDB. Use of MongoDB allows for a scalable, fulltext searching to be done.

Use Case -- Direct Marketing

As mentioned above, one use case for such a database is direct marketing. Imagine if you will, a small company, Acme AFMs, who developed a new instrument and want to know if it’s worth going into full-scale production. The first thing they do is establish there is active demand by searching for the grants which are related to AFMs for the last sixteen years. After establishing there is amply demand based on the number of grants awarded, the next step is to get a list of the most active faculty/institutions who use AFMs. Again, this information that's easily obtained from the database. The figures below provides a visualization of these search results.

Conclusion

Web scrapping of grant awards information to create a searchable database of active STEM researchers was successfully done using the Scrapy Framework and MongoDB. Test use cases clearly show the value of this system and its potential. The next steps for this project are to:

Develop Interactive web application
Use machine learning for keyword tagging of awards
Add additional data from other granting agencies and publications
Explore predictive modeling based on this data

About Author

Nathan Stevens

Nathan holds a Ph.D. in Nanotechnology and Materials Science from the City University of New York graduate school, and has worked on numerous software and scientific research projects over the last 10 years. Software projects have ranged from...

View all posts by Nathan Stevens >

AWS

Automated Data Extraction and Transformation Using Python, OpenAI, and AWS

Python

Can the data from EA's FIFA Potential Rating Help Bettors?

Data Visualization

Using Data to Get Cats Adopted on petfinder.com

Data Visualization

Wine 101: Gathering Data From Vivino

Python

Using Data to Analyze The Library of Audible

Cancel reply

You must be logged in to post a comment.

M_Coleman April 7, 2017

Girls wanted, no matter where you live! - to high paid job. If you are daring and young women between 18-40 years old, you can earn $1000/week, when you get experienced you can make 3 times more. I won't spam any websites here, if you are interested, you can google it: Jevlo's jobs modeling

Rion Dooley December 12, 2016

There seems to be some interest both inside and outside of NSF in making this data public. Is this project available as OSS? Is there an opportunity to collaborate and move it forward?

Scraping NSF Awards to Create Database of Active STEM Researchers

Introduction

Approach

Use Case -- Direct Marketing

Conclusion

About Author

Nathan Stevens

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Scraping NSF Awards to Create Database of Active STEM Researchers

Introduction

Approach

Use Case -- Direct Marketing

Conclusion

About Author

Nathan Stevens

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!