Using Machine Learning to aid Journalism at the New York Times
Daeil Kim is currently a data scientist at the Times and is finishing up his Ph.D at Brown University on work related to developing scalable inference algorithms for Bayesian Nonparametric models. His work at the Times spans a variety of problems related to the company's business interests, audience development, as well as developing tools to aid journalism.
This talk will focus mostly on how machine learning can help problems that prop up in journalism. We'll begin first by talking about using popular supervised learning algorithms such as regularized Logistic Regression to help assist a journalist's work in uncovering insights into a story regarding the recall of Takata airbags in cars. Afterwards, we'll think about using topic modeling to deal with large document dumps generated from FOIA (Freedom of Information Act) requests and Refinery, a simple web based tool to ease the implementation of such tasks. Finally, if there is time, we will go over how topic models have been extended to assist in the problem of designing an efficient recommendation engine for text-based content.
Video Camera Recording:
Desktop Video recording:
- 1. Aiding journalism with machine learning @ NYT Dae Il Kim - [email protected]imes.com
- 2. Overview ● The Story of Faulty Takata Airbags ○ Using Logistic Regression to predict suspicious comments ● Dealing with large document corpuses: The FOIA problem ○ What are Topic Models? ■ What are topics and why are they useful? ■ Latent Dirichlet Allocation - A Graphical Model Perspective ■ Scalable Topic Models ○ Refinery: A Locally Deployable Web Platform for Large Document Analysis ■ The Technology Stack for Refinery ■ How does Refinery work? ● Future Directions
- 3. The Story of Faulty Takata Airbags
- 4. Complaints data from NHTSA complaints The Data Data contains 33,204 comments with 2219 of these painstakingly labeled as being suspicious (by Hiroko Tabuchi). A Machine Learning Approach Develop a prediction algorithm that can predict whether a comment was either suspicious or not. The algorithm will then learn from the dataset which features are representative of a suspicious comment.
- 5. The Machine Learning Approach A sample comment. We will preprocess this data for the algorithm - NEW TOYOTA CAMRY LE PURCHASED JANUARY 2004 - ON FEBRUARY 25TH KEY WOULD NOT TURN (TOOK 10 - 15 MINUTES TO START IT) - LATER WHILE PARKING, THE CAR THE STEERING LOCKED TURNING THE CAR TO THE RIGHT - THE CAR ACCELERATED AND SURGED DESPITE DEPRESSING THE BRAKE (SAME AS ODI PEO4021) - THOUGH THE CAR BROKE A METAL FLAG POLE, DAMAGED A RETAINING WALL, AND FELL SEVEN FEET INTO A MAJOR STREET, THE AIR BAGS DID NOT DEPLOY - CAR IS SEVERELY DAMAGED: WHEELS, TIRES, FRONT END, GAS TANK, FRONT AXLE - DRIVER HAS A SWOLLEN AND SORE KNEE ALONG WITH SIGNIFICANT SOFT TISSUE INJURIES INCLUDING BACK PAIN *SC *JB TOKENIZE (NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Break this into individual words (NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Break this into bigrams (every two word combinations) FILTER (NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Remove tokens that appear in less than 5 comments (NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Remove bigrams that appear in less than 5 comments The data now consists of 33,204 examples with 56,191 features DATA IS READY FOR TRAINING!
- 6. Cross-Validation Comment ID Features (i.e word frequency) 0 0 0 3 1 0 2 0... 1 0 0 0 2 0 1 1... ... 1 1 5 1 2 0 0 1... Labels (S = Suspicious, NS = Not Suspicious) This is our training set. Take a subset of the data for training S NS S S NS NS NS NS NS This is our test set. After training, test on this dataset to obtain accuracy measures.
- 7. How did we do? Experiment Setup We hold out 25% of both the suspicious and not suspicious comments for testing and train on the rest. We do this 5 times, creating random splits and retraining the model with these splits. Performance! We obtain a very high AUC (~.97) on our test sets. Check what we missed These comments are potentially worth checking twice.
- 8. The most predictive words / features Predictive of a suspicious comment Predictive of a normal comment. After training the model, we then applied this on the full dataset. We looked for comments that Hiroko didn’t label as being suspicious, but the algorithm did to follow up on (374 / 33K total). Result: 7 new cases where a passenger was injured were discovered from those comments she missed.
- 9. Dealing with large document corpuses (i.e FOIA dumps) We’ll use Topic Models for making sense of these large document collections!
- 10. What are Topic Models? There are reasons to believe that the genetics of an organism are likely to shift due to the extreme changes in our climate. To protect them, our politicians must pass environmental legislation that can protect our future species from becoming extinct… Decompose documents as a probability distribution over “topic” indices 1 “Climate Change” 0 “Politics” “Genetics” Topics in turn represent probability distributions over the unique words in your vocabulary. “Politics” “Climate Change” “Genetics”
- 11. Topic Models: A Graphical Model Perspective LDA: Latent Dirichlet Allocation (Bayesian Topic Model) Blei et. al, 2001 1 “Climate Change” 0 “Politics” “Genetics” dna: 2, obama: 1, state: 1, gene: 2, climate: 3, government: 1, drug: 2, pollution: 3
- 12. Bayes Theorem Prior belief about the world. In terms of LDA, our modeling assumptions / priors. Normalization constant makes this problem a lot harder. We need this for valid probabilities. Likelihood. Given our model, how likely is this data? Posterior distribution. Probability of our new model given the data.
- 13. Posterior Inference in LDA GOAL: Obtain this posterior which means that we need to calculate this intractable term: For LDA, this represents the posterior over latent variables representing how much a document contains of topic k (θ) and topic word assignments z. LDA: Latent Dirichlet Allocation (Bayesian Topic Model) Blei et. al, 2001
- 14. Scalable Learning & Inference in Topic Models LDA: Latent Dirichlet Allocation (Bayesian Topic Model) Blei et. al, 2001 Update θ, z, and β after analyzing each mini-batch of documents. Analyze a subset of your total documents before updating.
- 15. Refinery: An open source web-app for large document analyses Daeil Kim @ New York Times Founder of Refinery [email protected] Ben Swanson @ MIT Media Lab Co-Founder of Refinery [email protected] Refinery is a 2014 Knight Prototype Fund winner. Check it out at: http://docrefinery.org
- 16. Installing Refinery 3 Simple Steps to get Refinery running Install these first! 1) Command → git clone https://github.com/daeilkim/refinery.git 2) Go to the root folder. Command → vagrant up 3) Open brower and go to --> 220.127.116.11:8080
- 17. A Typical Refinery Pipeline Step 1: Upload documents Step 2: Extract Topics from a Topic Model Step 3: Find a subset of documents with topics of interest. Step 4: Discover Interesting Phrases
- 18. A Quick Refinery Demo Extracting NYT articles from keyword “obama” in 2013. What themes / topics defined the Obama administration during 2013?
- 19. Future Directions: Better tools for Investigative Reporting Collecting & Scraping Data Refinery focuses on extracting insights from relatively clean data Great tools like DocumentCloud take care of steps 1 & 2 Enterprise stories might be completed in a fraction of the time. Filtering & Cleaning Data Extracting Insights
- 20. Interesting Extensions to Topic Models Combining topic models with recommendation systems. LDA / Topic Modeling Matrix Factorization Model Generative Process Generative Process Benefits ● The model think of users as mixtures of topics. We are what we read and rate. ● The ratings in turn help shape the topics that are also discovered. ● Can do in-matrix and out of matrix predictions.