Computer Crime Intelligence Using Darkweb Data

Nick
Posted on Nov 6, 2016

He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 26 to Dec 23 2016. The post was based on his third class project (due at the 6nd week of the program).

Market Impact

The deep web is a foreign, mysterious part of the internet where the military, whistleblowers, police, journalists and privacy seekers can work securely and privately. It is completely legal to access websites but the privacy enables a criminal underground to thrive on websites deemed the "Darknet" where drugs, weapons, child pornography, faked documents and the services hitmen can be bought. While these are topics are best left to security experts that work to prevent people from being hurt, I feel that drug sales have a direct impact on our economy including stock prices, medical coverage, digital currency prices and legislature.

Insights gained from underground drug sale analysis can have a major impact on many groups. What might be the financial benefit of legalizing certain drugs? Are there signs that a market is about to crash? Could pharmaceuticals raise prices on drugs that may find their ways onto these markets because of an increase of demand? Some drugs on there are legal, perhaps a new company could see enough demand introduce a generic alternative to fill a demand. Bitcoin investors may be able to better judge right times to buy or sell. Could interest in these drugs help identify which drugs to prioritize legalization of in the future?

Data

I used a dataset from the Gwern.net blog. The author, a freelance writer and researcher, scraped "~89 markets, >37 forums and ~5 other sites, representing <4,438 mirrors of >43,596,420 files in ~49.4GB of 163 compressed files, unpacking to >1548GB." The portion of the data I used are near daily scrapes of these markets in CSV format from June 9, 2014 to April 17, 2016 from a serivce named "Grams" that indexes darknet markets by scraping or using APIs.

This data is comprised of 5.5 billion lines and takes up 20GB of space. The data includes an item description, an extended description, the price, the country the item is shipped from, and the deep web address. The item description includes the product (guns, coupons, drugs etc.), the quantity and dosage in a written stlye that varied greatly in style.

I also grabbed Bitcoin exchange rates and exchange volumes from Quandl.

Gwern Blog

Gwern Data Archives

Quandl Bitcoin Data

Tools

The scope of this project is to design a web-site using RStudio and Shiny Dashboard to deploy a website for data visualization within a two week period. A few of the tools that were used in my project are:

  • Amazon Web Services EC2 Server for crunching data and hosting the Shiny application
  • Shiny Dashboard for the interface
  • ggPlot2 for generating plots of the data
  • Dplyr for finding relations within pieces of data
  • Regular expression for extracting drug names, quantities and dosages

Processing

In order to be able to process the data, I needed to strip out useless columns to save space and improve computation overall time by not needing to pass massive arrays through memory. I extracted the drug name by running through a database of drug names I created, translated it to the common term and stored it in an adjacent column. In order to extract values from item descriptions, I started by finding where the dosages were listed, stored the corresponding value and unit and removed it to prevent effecting the quantity I needed to grab from the item description. I lastly stored the quantity and units of measure (vial, pills, tablets, blotter etc.). I then converted these measure into common values then multiplied their prices and quantities to get a price per gram in Bitcoin. I then multiplied the Bitcoin measure that day's Bitcoin exchange rate to get price per gram in USD. I finally translated the country names into common names and used "Unknown" as a name for those that purposefully posted ambigous countries.

Personal Learning Outcomes

Before, regular expression was a mysterious black box and would only use it if I could find the proper answer on StackOverflow. This project has turned me into a wizard of text extraction and I had to get creative with how I found values. One regular expression I created had shaved off nearly dozens of lines of code and hours of processing and searching with other methods. I didn't expect to walk away knowing so much alternative names of countries and drug lingo that I could combine related values. The resulting quality of the final data should make it simple for a user to find the exact parameters he/she wants. Implementing the ability to make granular choices including dates, individual markets, dates accessed and more gives user powerful control of their graphs.

Working through this project also taught me about how to set up and link Amazon Web Service EC2 servers and S3 data stores. I can now quickly set up an RStudio EC2 instance for serving my websites and processing data. I appreciate having rStudio available in the cloud because it will enable me to be able to quickly deploy future websites.

Conclusions

I created a working prototype that presents a dashboard of interesting facts that are randomly generated from that data sets (average drug price on this day, a country's most common drug etc.). I also created a tool that enables a user to filter by a range of post dates, scraped datas, and  drug, market, units, adjust drug quantity outputs. I plot the output of these filters into graphs including comparisons to line graphs of bitcoin prices and drug prices, bar graphs of the popularity of different data and the density of their occurrences.

Darknet Market Analyzer Webiste

Github Code

Future Updates

  • Use machine learning to predict bitcoin, stock prices, and future drug prices
  • Allow graph customization

About Author

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Classes Demo Day Demo Lesson Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet Lectures linear regression Live Chat Live Online Bootcamp Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Lectures Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking Realtime Interaction recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp