Quantum Computing - A Web Scraping Project
Quantum Computers are part of an emerging class of technology that exploits the probabilistic nature of quantum mechanical systems. Renowned thinkers Richard Feynman and David Deutsch developed the theoretical foundations of quantum computers decades ago. These physicists postulated that it would be possible to develop algorithms for quantum computers.
The energy state of quantum systems can exist in a superposition of two distinct states simultaneously. This property, known as superposition, serves as the fundamental feature of quantum computers. Quantum computer can harness this remarkable characteristic in the form of quantum bits ("qubits"). Classical computers' fundamental unit of information -- the bit --exists in a binary state of 0 or 1. Qubits, the quantum computing counterpart, exists in a superposition of states 0 and 1 at the same time. This duality endows quantum computers incredible advantages. Stringing together qubits means an exponential growth of the number of states they can represent. Entanglement, another quantum feature, could allow scientists and quantum programmers to influence qubits that are not even physically connected to each other. Such command of the quantum realm would mean programmers could design algorithms that work around the sequential logic processes of classical computation.
State of Quantum Technology
This technology existed mostly in the imagination of science fiction writers until roughly ten years ago. Isolating particles from outside noise and interference still proves to be a daunting task. If these particles (qubits in this case) are not properly insulated, the superposition of the particles collapse from a probabilistic wave-function into a determinant state, rendering quantum computation impossible.
A leader in the field, D-Wave Systems announced headway in isolating qubits in 2015. Since then, many developments have moved the industry forward. Governments, Big Tech, and financial institutions have since signed up for the quantum revolution.
The major players in the field fall into two camps: companies building the hardware and companies developing the software applications.
Companies developing the software are working on quantum cloud platforms, optimization, machine learning algorithms, and quantum chemistry problems.
My project objective was to find insights about the state of social media awareness of the quantum computing industry and to identify practices that raise public awareness. In doing so, I scraped tweets posted between 2012 and 2017. I sought to uncover how the content of tweets about this industry reflected the public's expectations, what figures in the media were communicating the topic, and any relationships that might exist between the different twitter feature functions (Like, Reply, Retweet).
Selenium and Beautiful Soup
I needed my web scraper to scroll through pages of Twitter News Search with the keyword: "Quantum Computing" (while retaining the same URL) . While first working with Beautiful Soup (a Python HTML parser), I transitioned to Selenium (a python API) and its Webdriver function for help scrolling through different pages that retained the same URL. I devised code with a "while" loop to scroll to the end of the page and keep going (within some defined limit). My Python script then then runs a "for" loop that opens an empty dictionary. This dictionary tries pulling relevant fields (Tweet Content, Author, Time, Likes, etc.) based on the element's xpath. Running the web scraper's Python script in a LINUX terminal populates this dictionary and outputs a ".csv" file.
After producing the .csv file with the web scraper, I processed the data in Python-based Jupyter notebook.
Data tidying/Data Manipulation
I began making my data manageable by filling NA's with empty strings or ' 0 ', depending on whether it is a qualitative or quantitative variable. I also had to turn the columns detailing the Twitter feature functions (Like, Reply, Retweet) into numeric vectors.
I dropped the year 2018 because I did not have a full year's worth of tweets for this year and I did not want to compare incomplete data. I had to combine the date/month/year columns into one string and then convert that string to a date_time object in order to have proper object-type for time series analysis.
As part of pre-preprocessing procedure, I converted all the strings to lower cases, filtered out unnecessary punctuation, and a long list of minutiae that made for seamless processing of the data. I want to note that this data tidying was an iterative process: I would often find issues in the code as I proceeded with the data visualization part of the project and then come back to edit this code accordingly. After completing this project, I can better anticipate obstacles and solutions that are lurking in an uncleaned dataset.
A prime example of this was when I removed stop words from the content of the tweets (superfluous words like "the" and "I") for the purposes of generating a word count. In order to overcome this stumbling block, I imported a set of stop words from the nltk.corpus package. I extended the list of stop words to include strings of unnecessary words that were appearing in my word count (i.e. "could" and "co...").
Word Count - Word Cloud & List
Natural Language Processing
After tidying my data, I generated a wordcloud object off the tweet content and plotted the image below. I ran into some issues while constructing a word count list which were resolved when I created an empty dictionary followed by a "for" loop that adds to the index of each word for every instance that word is in the content. That index helped me create a sorted list from the dictionary. I used a similar process when aggregating the author count.
My word count analysis proved to be insightful. For one, key names like Justin Trudeau, Canada, Toronto popped up with great frequency. This reflects the Canadian government's concerted effort to be at the forefront of this industry. While many U.S. Companies appear in the list, the U.S. government and political figures show up with less frequency than their Canadian counterparts. This underscores the amount of room there is for the U.S. military and scientific community to establish and extend their public presence and engagement in this space.
Additionally, many words like research, explain, breakthrough, launches popped up, reflecting the early-stage nature of the industry. Not much of this industry is has yet been commercialized; quantum computing technology still exists primarily in physics labs and R&D departments. Companies often cited included not only household names like Microsoft, Google, & IBM, but also the new quantum hardware companies D-Wave technologies and Rigetti.
Additionally, theses images reveal the public's hesitation of this new technology. References to artificial intelligence and and an A.I arms "race" appear in a few instances.
According to my analysis, relevant media outlets have begun to pick up on this industry. Engadget, a tech blog, TechChrunch, Forbes and Financial Times are among the most prominent publishers in this space. With the help of the 'plotly' python graphing library, we can examine this visually below.
Tweet Count over time
Finally, I wanted to get a high-level overview how awareness of this industry has changed over time. As you can see below, the number of tweets on this subject has exploded in recent years.