Data Visualizing Alberta's Oil and Gas Activity

Posted on Mar 19, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


Alberta is one of the wealthiest and most resource rich provinces of Canada. Data shows large part of this wealth flows from the burgeoning oil and gas industry. From its nascent start in the early 20th century, Alberta now has a massive oil and gas infrastructure build to tap into its oil reserves. These reserves are third only to Saudi Arabia and Venezuela. Alberta’s commercial capital, Calgary boasts of the most number of corporate head offices in Canada outside of Toronto.

Oil and gas investment in the province topped out at $32.7 billion in 2013. Keeping a tap on the oil patch numbers could be a difficult task even for the most zealous analysts i.e. if they don’t have the right tools. I felt the same way working as an analyst with a firm that had significant exposure to oil and gas services sector. So as part of my first project at NYC Data Science Academy, I decided to a do a proof of concept for a tool that can be used for visualizing Alberta's oil and gas activity.

Main questions I was trying to answer through visualization were: how active is the local market? Which companies are drilling and where? Who is applying for oil well licenses? What kind of wells are they digging and for what products? Are they building pipelines too? From where and for what products? Which license holders and drilling contractors are the busiest? How deep are they digging? Has there been a marked shift in activity because of the oil and gas industry downturn?

Checkout the app, explore and find answers -

2017-03-25 (3)

Monthly trend for top five drilling contractors in Alberta.

Data Source

Data Visualizing Alberta's Oil and Gas Activity

Heatmap of abandoned wells in Alberta generated from a shapefile.

Energy industry in Alberta is regulated by Alberta Energy Regulator (AER). Most of the activities mentioned above require permits prior to commencement of work. Companies also need to notify AER when the work begins. For most activities, companies can log into their AER portal and make the required disclosures.

AER collects this data and makes some of it available on their data portal. Although most of it is available free of cost, specific skills are required to slice/dice and make a meaning out it.

AER provides activity reports on drilling activity, pipeline construction activity, well licensing and abandoned wells etc. Most of this information is produced in the form of text files broken down by daily activity. Kind of like this. Information about abandoned wells is available in the shapefile format. What is a shapefile? Go figure.


First step was to download the files and unzip them. Although the files can be downloaded and unzipped as a batch, daily activity report for each category is contained in a single text file. I narrowed the project down to new wells drilled, new well licenses issued and new pipeline approvals from January 2012 to December 2016. This meant that the number of text files was 3 categories * 4 years * 365 days = 4,380 + 1 (for leap year). General outline of the code for reading and cleaning the data was as follows -

  1. Read the names of the files contained in all the folders recursively into a list
  2. Iterate through the list of file names and read each text file into memory
  3. Remove unnecessary headers and footers
  4. Get the lengths of various columns from the tables denoted by dashes (---) at the top or hard code them
  5. Go through every column, extract the text and breakdown the text into columns based on the length of dashes or hard code the lengths
  6. Save this text in a dataframe and perform cleaning operations like separating date and time, separating dominion land survey fields etc.
  7. Write the dataframe to a csv file when processing is done and repeat for the next file

    Township Grid System of Alberta. Source.xt category

Outline of Script

AER publishes location data in DLS (Dominion Land Survey) format. Mapping using this system is not supported by any of the freely available mapping libraries in R. So, I had to write a short script to convert each DLS coordinate into an approximate lat/long location. You can read more about DLS here. This part of the code can be outlined as follows -

  1. Find the north-south distance of the point from the 49th parallel
  2. Find the east-west distance of the point from the nearest median to the east
  3. Use this information to calculate bearing of the location from the intersection of 49th parallel and nearest median to the east
  4. Calculate the latitude and longitude using the distances calculated in the first 2 parts and bearing

This method correctly handles the inconsistencies from road allowances, earth’s circumference and other nuances of DLS.

Outline of Analysis

Visualization of the data is usually simpler than the extraction and pre-processing. Simple outline of a typical analysis operation is as follows -

  1. Decide the metric and group data accordingly
  2. Mutate, filter and apply operations on rows or to separate columns
  3. Orient the data so that it is ideal for visualization
  4. Convert date information into time-series data (if required)
  5. Group data as needed
  6. Visualize using the relevant tools (graphs or maps)

Last part was to combine all the data processed in the previous parts into a clean shiny app. Code for the project can be found here.


The project is currently parked on Github and the shiny app can be found at It can be expanded to include other Canadian provinces and the US. Incorporating other regulators could enable companies to keep a tap on the pulse of the O&G industry. This would enable monitoring of expected expansion or contraction of O&G industry in real time and could help in predicting potential increase or decrease in oil supply. Further, formations with over-investment/under-investment, transportation (pipeline) bottlenecks, and areas that might need increased environmental watch could also be identified.

Project is hosted on a Shiny Dashboard server and can be accessed at

Tools Used

Data extraction and manipulation – R, RStudio, Dplyr, Tidyr, SP, Geosphere, Lubridate.

Mapping - Leaflet.

Graphing – Xts(timeseries preprocessing), Dygraphs.

Knitted together with Shiny and Shiny Dashboard. Hosted at Shiny Server with masked forwarding from GoDaddy.

About Author

Kamal Sandhu

Kamal Sandhu is a finance professional keenly interested in the potential of data science in combination with financial and management theory. He is working towards the Chartered Financial Analyst (CFA) program and the Financial Risk Manager (FRM) program....
View all posts by Kamal Sandhu >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI