Contributed by Steven Ginsberg.He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his third class project - Web Scraping, due on the 6th week of the program.
For my first web scraping project, I chose SALARY.COM. My plan was to gather salary data, descriptive information across categories and industries, and compare salary information for various jobs.
From the home page, salary.com has a link to a categories page. Under each category, there is are up to 15 pages of jobs with summary descriptions. Each job has a detailed description page, and finally it links to a salary chart, which sometimes (but not always) has an advertising page pop up in between.
I made a number of attempts to drill down the 4-5 levels, gather up the detailed information, and get back to the job titles summary page to move on to the next job detail. Having Selenium click on the link and hitting “Back” gave a number of incongruous and inconsistent errors and results. Finally, by late Saturday I gave up trying to drill down so many layers. I downloaded the categories, and was able to reliably flip through the Job Summary pages, gathering the job titles and summary descriptions across many pages and also get back to the starting point consistently. Finally!
All of my attempts to drill deeper and get back to the starting point failed. At the end of the day I pulled 68 categories from the category page and was able to gather summary information for 2,190 jobs. Since I only had text information, word clouds are the end result.