Computer Crime Intelligence Using Darkweb Data
He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 26 to Dec 23 2016. The post was based on his third class project (due at the 6nd week of the program).
The deep web is a foreign, mysterious part of the internet where the military, whistleblowers, police, journalists and privacy seekers can work securely and privately. It is completely legal to access websites but the privacy enables a criminal underground to thrive on websites deemed the "Darknet" where drugs, weapons, child pornography, faked documents and the services hitmen can be bought. While these are topics are best left to security experts that work to prevent people from being hurt, I feel that drug sales have a direct impact on our economy including stock prices, medical coverage, digital currency prices and legislature.
Insights gained from underground drug sale analysis can have a major impact on many groups. What might be the financial benefit of legalizing certain drugs? Are there signs that a market is about to crash? Could pharmaceuticals raise prices on drugs that may find their ways onto these markets because of an increase of demand? Some drugs on there are legal, perhaps a new company could see enough demand introduce a generic alternative to fill a demand. Bitcoin investors may be able to better judge right times to buy or sell. Could interest in these drugs help identify which drugs to prioritize legalization of in the future?
I used a dataset from the Gwern.net blog. The author, a freelance writer and researcher, scraped "~89 markets, >37 forums and ~5 other sites, representing <4,438 mirrors of >43,596,420 files in ~49.4GB of 163 compressed files, unpacking to >1548GB." The portion of the data I used are near daily scrapes of these markets in CSV format from June 9, 2014 to April 17, 2016 from a serivce named "Grams" that indexes darknet markets by scraping or using APIs.
This data is comprised of 5.5 billion lines and takes up 20GB of space. The data includes an item description, an extended description, the price, the country the item is shipped from, and the deep web address. The item description includes the product (guns, coupons, drugs etc.), the quantity and dosage in a written stlye that varied greatly in style.
I also grabbed Bitcoin exchange rates and exchange volumes from Quandl.
The scope of this project is to design a web-site using RStudio and Shiny Dashboard to deploy a website for data visualization within a two week period. A few of the tools that were used in my project are:
- Amazon Web Services EC2 Server for crunching data and hosting the Shiny application
- Shiny Dashboard for the interface
- ggPlot2 for generating plots of the data
- Dplyr for finding relations within pieces of data
- Regular expression for extracting drug names, quantities and dosages
In order to be able to process the data, I needed to strip out useless columns to save space and improve computation overall time by not needing to pass massive arrays through memory. I extracted the drug name by running through a database of drug names I created, translated it to the common term and stored it in an adjacent column. In order to extract values from item descriptions, I started by finding where the dosages were listed, stored the corresponding value and unit and removed it to prevent effecting the quantity I needed to grab from the item description. I lastly stored the quantity and units of measure (vial, pills, tablets, blotter etc.). I then converted these measure into common values then multiplied their prices and quantities to get a price per gram in Bitcoin. I then multiplied the Bitcoin measure that day's Bitcoin exchange rate to get price per gram in USD. I finally translated the country names into common names and used "Unknown" as a name for those that purposefully posted ambigous countries.
Personal Learning Outcomes
Before, regular expression was a mysterious black box and would only use it if I could find the proper answer on StackOverflow. This project has turned me into a wizard of text extraction and I had to get creative with how I found values. One regular expression I created had shaved off nearly dozens of lines of code and hours of processing and searching with other methods. I didn't expect to walk away knowing so much alternative names of countries and drug lingo that I could combine related values. The resulting quality of the final data should make it simple for a user to find the exact parameters he/she wants. Implementing the ability to make granular choices including dates, individual markets, dates accessed and more gives user powerful control of their graphs.
Working through this project also taught me about how to set up and link Amazon Web Service EC2 servers and S3 data stores. I can now quickly set up an RStudio EC2 instance for serving my websites and processing data. I appreciate having rStudio available in the cloud because it will enable me to be able to quickly deploy future websites.
I created a working prototype that presents a dashboard of interesting facts that are randomly generated from that data sets (average drug price on this day, a country's most common drug etc.). I also created a tool that enables a user to filter by a range of post dates, scraped datas, and drug, market, units, adjust drug quantity outputs. I plot the output of these filters into graphs including comparisons to line graphs of bitcoin prices and drug prices, bar graphs of the popularity of different data and the density of their occurrences.
- Use machine learning to predict bitcoin, stock prices, and future drug prices
- Allow graph customization