Data Visualizing Alberta's Oil and Gas Activity
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
Alberta is one of the wealthiest and most resource rich provinces of Canada. Data shows large part of this wealth flows from the burgeoning oil and gas industry. From its nascent start in the early 20th century, Alberta now has a massive oil and gas infrastructure build to tap into its oil reserves. These reserves are third only to Saudi Arabia and Venezuela. Alberta’s commercial capital, Calgary boasts of the most number of corporate head offices in Canada outside of Toronto.
Oil and gas investment in the province topped out at $32.7 billion in 2013. Keeping a tap on the oil patch numbers could be a difficult task even for the most zealous analysts i.e. if they don’t have the right tools. I felt the same way working as an analyst with a firm that had significant exposure to oil and gas services sector. So as part of my first project at NYC Data Science Academy, I decided to a do a proof of concept for a tool that can be used for visualizing Alberta's oil and gas activity.
Main questions I was trying to answer through visualization were: how active is the local market? Which companies are drilling and where? Who is applying for oil well licenses? What kind of wells are they digging and for what products? Are they building pipelines too? From where and for what products? Which license holders and drilling contractors are the busiest? How deep are they digging? Has there been a marked shift in activity because of the oil and gas industry downturn?
Checkout the app, explore and find answers - ab.oilandgas.io.
Data Source
Energy industry in Alberta is regulated by Alberta Energy Regulator (AER). Most of the activities mentioned above require permits prior to commencement of work. Companies also need to notify AER when the work begins. For most activities, companies can log into their AER portal and make the required disclosures.
AER collects this data and makes some of it available on their data portal. Although most of it is available free of cost, specific skills are required to slice/dice and make a meaning out it.
AER provides activity reports on drilling activity, pipeline construction activity, well licensing and abandoned wells etc. Most of this information is produced in the form of text files broken down by daily activity. Kind of like this. Information about abandoned wells is available in the shapefile format. What is a shapefile? Go figure.
Methodology
First step was to download the files and unzip them. Although the files can be downloaded and unzipped as a batch, daily activity report for each category is contained in a single text file. I narrowed the project down to new wells drilled, new well licenses issued and new pipeline approvals from January 2012 to December 2016. This meant that the number of text files was 3 categories * 4 years * 365 days = 4,380 + 1 (for leap year). General outline of the code for reading and cleaning the data was as follows -
- Read the names of the files contained in all the folders recursively into a list
- Iterate through the list of file names and read each text file into memory
- Remove unnecessary headers and footers
- Get the lengths of various columns from the tables denoted by dashes (---) at the top or hard code them
- Go through every column, extract the text and breakdown the text into columns based on the length of dashes or hard code the lengths
- Save this text in a dataframe and perform cleaning operations like separating date and time, separating dominion land survey fields etc.
- Write the dataframe to a csv file when processing is done and repeat for the next file
Township Grid System of Alberta. Source.xt category
Outline of Script
AER publishes location data in DLS (Dominion Land Survey) format. Mapping using this system is not supported by any of the freely available mapping libraries in R. So, I had to write a short script to convert each DLS coordinate into an approximate lat/long location. You can read more about DLS here. This part of the code can be outlined as follows -
- Find the north-south distance of the point from the 49th parallel
- Find the east-west distance of the point from the nearest median to the east
- Use this information to calculate bearing of the location from the intersection of 49th parallel and nearest median to the east
- Calculate the latitude and longitude using the distances calculated in the first 2 parts and bearing
This method correctly handles the inconsistencies from road allowances, earth’s circumference and other nuances of DLS.
Outline of Analysis
Visualization of the data is usually simpler than the extraction and pre-processing. Simple outline of a typical analysis operation is as follows -
- Decide the metric and group data accordingly
- Mutate, filter and apply operations on rows or to separate columns
- Orient the data so that it is ideal for visualization
- Convert date information into time-series data (if required)
- Group data as needed
- Visualize using the relevant tools (graphs or maps)
Last part was to combine all the data processed in the previous parts into a clean shiny app. Code for the project can be found here.
Future
The project is currently parked on Github and the shiny app can be found at ab.oilandgas.io. It can be expanded to include other Canadian provinces and the US. Incorporating other regulators could enable companies to keep a tap on the pulse of the O&G industry. This would enable monitoring of expected expansion or contraction of O&G industry in real time and could help in predicting potential increase or decrease in oil supply. Further, formations with over-investment/under-investment, transportation (pipeline) bottlenecks, and areas that might need increased environmental watch could also be identified.
Project is hosted on a Shiny Dashboard server and can be accessed at ab.oilandgas.io.
Tools Used
Data extraction and manipulation – R, RStudio, Dplyr, Tidyr, SP, Geosphere, Lubridate.
Mapping - Leaflet.
Graphing – Xts(timeseries preprocessing), Dygraphs.
Knitted together with Shiny and Shiny Dashboard. Hosted at Shiny Server with masked forwarding from GoDaddy.