Quantum Computing - A Web Scraping Project

Posted on Oct 2, 2018

Background

Quantum Computers are part of an emerging class of technology that exploits the probabilistic nature of quantum mechanical systems. Renowned thinkers Richard Feynman and David Deutsch developed the theoretical foundations of quantum computers decades ago. These physicists postulated that it would be possible to develop algorithms for quantum computers.

The energy state of quantum systems can exist in a superposition of two distinct states simultaneously. This property, known as superposition, serves as the fundamental feature of quantum computers. Quantum computer can harness this remarkable characteristic in the form of quantum bits ("qubits"). Classical computers' fundamental unit of information -- the bit --exists in a binary state of 0 or 1.  Qubits, the quantum computing counterpart, exists in a superposition of states 0 and 1 at the same time. This duality endows quantum computers incredible advantages. Stringing together qubits means an exponential growth of the number of states they can represent. Entanglement, another quantum feature, could allow scientists and quantum programmers to influence qubits that are not even physically connected to each other. Such command of the quantum realm would mean programmers could design algorithms that work around the sequential logic processes of classical computation.

State of Quantum Technology

This technology existed mostly in the imagination of science fiction writers until roughly ten years ago. Isolating particles from outside noise and interference still proves to be a daunting task. If these particles (qubits in this case) are not properly insulated, the superposition of the particles collapse from a probabilistic wave-function into a determinant state, rendering quantum computation impossible.

A leader in the field,  D-Wave Systems announced headway in isolating qubits in 2015. Since then, many developments have moved the industry forward. Governments, Big Tech, and financial institutions have since signed up for the quantum revolution.

The major players in the field fall into two camps: companies building the hardware and companies developing the software applications.

Companies developing the software are working on quantum cloud platforms, optimization, machine learning algorithms, and quantum chemistry problems.


Project objective

My project objective was to find insights about the state of social media awareness of the quantum computing industry and to identify practices that raise public awareness. In doing so, I scraped tweets posted between 2012 and 2017. I sought to uncover how the content of tweets about this industry reflected the public's expectations, what figures in the media were communicating the topic, and any relationships that might exist between the different twitter feature functions (Like, Reply, Retweet).


Project timeline


Web Scraping

Selenium and Beautiful Soup

I needed my web scraper to scroll through pages of Twitter News Search with the keyword: "Quantum Computing" (while retaining the same URL) . While first working with Beautiful Soup (a Python HTML parser), I transitioned to Selenium (a python API) and its Webdriver function for help scrolling through different pages that retained the same URL. I devised code with a "while" loop to scroll to the end of the page and keep going (within some defined limit). My Python script then then runs a "for" loop that opens an empty dictionary. This dictionary tries pulling relevant fields (Tweet Content, Author, Time, Likes, etc.) based on the element's xpath. Running the web scraper's Python script in a LINUX terminal populates this dictionary and outputs a ".csv" file.


Data

After producing the .csv file with the web scraper, I processed the data in Python-based Jupyter notebook.

Data tidying/Data Manipulation

I began making my data manageable by filling NA's with empty strings or ' 0 ', depending on whether it is a qualitative or quantitative variable. I also had to turn the columns detailing the Twitter feature functions (Like, Reply, Retweet) into numeric vectors.

I dropped the year 2018 because I did not have a full year's worth of tweets for this year and I did not want to compare incomplete data. I had to combine the date/month/year columns into one string and then convert that string to a date_time object in order to have proper object-type for time series analysis.

As part of pre-preprocessing procedure, I converted all the strings to lower cases, filtered out unnecessary punctuation, and a long list of minutiae that made for seamless processing of the data. I want to note that this data tidying was an iterative process: I would often find issues in the code as I proceeded with the data visualization part of the project and then come back to edit this code accordingly. After completing this project, I can better anticipate obstacles and solutions that are lurking in an uncleaned dataset.

A prime example of this was when I removed stop words from the content of the tweets (superfluous words like "the" and "I") for the purposes of generating a word count. In order to overcome this stumbling block,  I imported a set of stop words from the nltk.corpus package. I extended the list of stop words to include strings of unnecessary words that were appearing in my word count (i.e. "could" and "co...").


Word Count - Word Cloud & List

Natural Language Processing

After tidying my data, I generated a wordcloud object off the tweet content and plotted the image below. I ran into some issues while constructing a word count list which were resolved when I created an empty dictionary followed by a "for" loop that adds to the index of each word for every instance that word is in the content.  That index helped me create a sorted list from the dictionary.  I used a similar process when aggregating the author count.

My word count analysis proved to be insightful. For one, key names like Justin Trudeau, Canada, Toronto popped up with great frequency. This reflects the Canadian government's concerted effort to be at the forefront of this industry. While many U.S. Companies appear in the list, the U.S. government and political figures show up with less frequency than their Canadian counterparts. This underscores the amount of room there is for the U.S. military and scientific community to establish and extend their public presence and engagement in this space.

Additionally, many words like research, explain, breakthrough, launches popped up, reflecting the early-stage nature of the industry.  Not much of this industry is has yet been commercialized; quantum computing technology still exists primarily in physics labs and R&D departments. Companies often cited included not only household names like Microsoft, Google, & IBM, but also the new quantum hardware companies D-Wave technologies and Rigetti.

Additionally, theses images reveal the public's hesitation of this new technology. References to artificial intelligence and and an A.I arms "race" appear in a few instances.

Author Count

According to my analysis, relevant media outlets have begun to pick up on this industry. Engadget, a tech blog, TechChrunch, Forbes and Financial Times are among the most prominent publishers in this space. With the help of the 'plotly' python graphing library, we can examine this visually below.

Tweet Count over time

Finally, I wanted to get a high-level overview how awareness of this industry has changed over time. As you can see below, the number of tweets on this subject has exploded in recent years.

About Author

Michael

Michael is currently a Data Science Fellow at the NYC Data Science Academy. He graduated from the Wharton School of the University of Pennsylvania with a B.S. in Economics before securing the position of Investment Banking Analyst at...
View all posts by Michael >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI