Mastering SQL: Drawing on My Learning Journey for a Practical Guide
We’re not born with our skills. We have to work at attaining them and look to others who have already gained mastery for guidance. That’s why I’m sharing my own personal struggle to master SQL and what I discovered along the way as a way to help others who have to face the same challenges.
When I started as Experimental Data Scientist at the National High Magnetic Field Laboratory, I had no prior experience with SQL. My background was primarily in Fortran, Matlab, and Python. I was new to Design of Experiments (DoE), data mining, and database management, and quickly realized that handling and analyzing large datasets from multiple sources was a vital aspect of my role.
One particular challenge involved merging data from multiple data warehouses to extract all the desired features needed for feature engineering and predictive modeling. The available tools were Origin Software and Matlab in the organization, but I had a different idea. This is when I discovered the power of SQL.
Getting a grasp on SQL
Initially, learning SQL felt overwhelming. The concepts were scattered, and I struggled to find a coherent learning path. Some colleagues used PostgreSQL, others preferred MySQL, and each had its own set of functionalities and quirks.
To gain understanding of which database management system is widely used, I scanned through hundreds of data science and analyst jobs on LinkedIn, Glassdoor, Indeed, ZipRecruiter, Monster, SimplyHire, WellFound, Dice.com, and other sites that required SQL. I found that the demand was mostly for MySQL and PostgreSQL. I had to decide which one to focus on, and this decision was not easy.
Upon doing further research on and comparing syntaxes, I realized learning MySQL is the best way forward due to most of the basic syntaxes working in the same or similar ways. Big data analytics platforms or cloud data warehouses like Snowflake, Google BigQuery, Amazon Redshift, Microsoft Synapse Azure Analytics, Databricks, Teradata Vantage, Vertica, etc. allow more powerful queries.
Clauses in context
For example, in big data warehouses like Snowflake, the QUALIFY clause can be used to filter the results of window functions directly within a SELECT statement. This is similar to how the HAVING clause works with aggregate functions and GROUP BY clauses. The QUALIFY clause is evaluated after the window functions are computed, allowing you to filter based on the results of these functions.
SELECT
employee_id,
department_id,
salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank
FROM
employees
QUALIFY
salary_rank = 1;
In contrast, in databases like MySQL and PostgreSQL, which do not support QUALIFY, you would need to use a Common Table Expression (CTE) to achieve the same result:
WITH RankedSalaries AS (
SELECT
employee_id,
department_id,
salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank
FROM
employees
)
SELECT
employee_id,
department_id,
salary,
salary_rank
FROM
RankedSalaries
WHERE
salary_rank = 1;
Another example is Amazon Redshift. While it is based on PostgreSQL, there are some differences between PostgreSQL and Amazon Redshift SQL. For more details, please refer to AWS Redshift and PostgreSQL Documentation.
The point I’m trying to make is that once you have a solid foundation in SQL, the specific SQL data warehouse or database management system becomes less important. You’ll be able to navigate their documentation and solve real world problems effectively.
Your SQL guide
As a Data Science Analyst at Iontra Inc. and a Machine Learning Fellow at NYC Data Science Academy, I often met colleagues who, like myself once, had no prior experience with SQL. They often felt overwhelmed by the material and struggled to grasp the concepts. This motivated me to compile this document , aiming to provide a comprehensive SQL guide with real-world problems.
This guide is intended to help users with both MySQL and PostgreSQL. There are only some differences between them; usually, PostgreSQL has more functionalities available, but MySQL is more flexible. This guide aims to bridge those gaps and provide a cohesive learning experience. It includes e-commerce, finance, and tech-focused examples.
The guide also includes LeetCode Top SQL 50 Study Plan problems ranging from Medium to Hard difficulty. Each problem comes with a detailed solution and a breakdown of the problem and query. By understanding and solving these problems, you will advance your SQL knowledge to an advanced or expert level as a data scientist.
Key Five SQL Concepts for Data Science
1. Basic SQL Statements
First, we start with basic SQL concepts such as SELECT, CREATE, INSERT, UPDATE, and WHERE operations. These foundational commands allow you to fetch data from the database, filter results, create new tables, and modify or update existing tables. Mastering these basics is essential for efficiently managing and manipulating data in any relational database.
2. Grouping Data
One of the most important topics in data science is grouping. Understanding how to use the GROUP BY clause was a game changer for me. It allows you to aggregate data and perform calculations on groups of rows, which is essential for summarizing data and gaining insights. For example, calculating the average sales per region or the total revenue per product category becomes straightforward with GROUP BY.
3. Joins
Another crucial concept is JOIN. I often encountered questions about when to use INNER JOIN, LEFT JOIN, RIGHT JOIN, or other types of joins. Joins are fundamental for combining data from multiple tables based on related columns. Understanding the differences and applications of each type of join is vital for merging datasets and ensuring data integrity.
4. Common Table Expressions (CTEs)
Sometimes, SQL queries can become long and complex, especially when dealing with nested subqueries. This can make the code difficult to read and maintain. Learning about Common Table Expressions (CTEs) was a breakthrough for me. CTEs allow you to break down complex queries into simpler, more manageable parts, improving readability and maintainability.
5. Window Functions
In data science, calculating running totals, cumulative sums, and identifying top performers are common tasks. Window functions are incredibly powerful for these purposes. They allow you to perform calculations across a set of table rows that are related to the current row. Functions like ROW_NUMBER(), RANK(), and SUM() over a window partition are invaluable for advanced data analysis.
Investing time in learning these SQL concepts is a pathway to success in data science. Mastering GROUP BY, JOINS, CTEs, and window functions will significantly enhance your ability to manage and analyze data efficiently. These skills not only make your work easier but also enable you to derive deeper insights and make more informed decisions.
This guide is intended to share the knowledge that I have particularly in an orderly manner so it's easy for readers to follow along and continue to learn with me with everything consolidated in one place. Please follow along with the how_to_sql.md guide. This page will be updated continuously.
I expect to add more real world problems and align it with the data engineering path as well. I am a lifelong learner who loves sharing my knowledge with others and serves as a mentor. Please reach out to me on LinkedIn should you have any questions. Feel free to reach out to me with any questions on . Enjoy your learning journey and best of luck!