Kelly Adams

Nov 15, 20245 min read

Building Better SQL Queries Through Understanding Data Structure

In data analysis, the focus is often on mastering tools or methods. But understanding the structure and flow of data is just as crucial, especially when filtering for specific rows or identifying patterns. In this post, I’ll show how understanding data structure and purpose can lead to efficient, effective queries, sharing two real-life examples of how this knowledge has helped me write more targeted SQL queries.

Note: Before analyzing data flow, I always review the entire table and any related documentation to understand the data context fully.

Why Understanding Data Workflow Matters

Before writing any query, I make sure to understand how data flows into each table and how each column is populated. Knowing this data workflow gives me insights into several key areas:

Data Completeness and Reliability: By knowing when and how data is logged, I can determine whether certain actions or outcomes might be missing, duplicated, or delayed.
Effective Filtering: Understanding which fields capture specific actions allows me to filter more accurately, making sure I’m focusing on the right data.
Efficient Query Design: With a clear idea of how data flows into each column, I can use CASE WHEN statements or aggregations effectively, keeping it ⁠⁠simple and reducing query run time.
Data Integrity Checks: When I understand the expected data flow, I can spot inconsistencies, like unexpected timestamps or missing records, that could signal issues in data collection.

Knowing these aspects of the data helps me design effective, accurate queries and avoid misinterpretations. Below, I’ll show some examples of how I apply these principles using CASE WHEN statements and CTEs.

Examples

Note: In the following examples, I’ve limited the query results to show just a few rows from a sample dataset that captures gameplay actions occurring within a few minutes on a single day. However, the query would get all of the data from 2024-11-03, so in practice, it would return a lot more data. This is also written in PostgreSQL syntax.

The Table: Tracking Gameplay Actions and Outcomes

Let’s say we have a table, gameplay_logs, that tracks player actions in a game. Every action is recorded in player_moves , and certain significant outcomes (e.g. bonuses or high scores) are recorded in special_outcomes. This lets us distinguish regular actions from significant outcomes like bonuses and high scores. The id is a unique id given to every game, player_id the unique player id, and play_timestamp column is the UTC timestamp the recorded action.

id	player_id	player_moves	special_outcomes	play_timestamp
1	1	Move A	Bonus	2024-11-03 15:23:45 UTC
2	1	Move B	High Score	2024-11-03 15:23:50 UTC
3	2	Move A		2024-11-03 15:24:35 UTC
4	2	Move B	High Score	2024-11-03 15:25:00 UTC
5	2	Move C	High Score	2024-11-03 15:25:25 UTC
6	3	Move A		2024-11-03 15:25:50 UTC
7	3	Move B	Bonus	2024-11-03 15:26:15 UTC
8	3	Move C	Bonus	2024-11-03 15:26:40 UTC
9	4	Move A		2024-11-03 15:27:05 UTC
10	4	Move C	High Score	2024-11-03 15:27:30 UTC
11	5	Move B		2024-11-03 15:27:55 UTC

Example 1: Writing a Query to Filter for Special Outcomes

How Knowing Data Flow Helps

Understanding that special_outcomes only logs significant actions like bonuses and high scores allows me to write more accurate filters. By using CASE WHEN statements only on columns that contain the specific actions I’m interested in, I avoid double-counting or misinterpreting regular moves. This ensures I’m correctly calculating the frequency of meaningful actions, leading to cleaner, more focused results.

Query

To identify certain player behaviors and patterns, I could track the number of each action type (e.g., regular moves, bonus moves, and high scores) for each player over a time period. So I’m getting for each player their count of regular moves, bonuses, and high scores. Using CASE WHEN with SUM, I can calculate how often each player achieves specific outcomes.

SELECT 
    player_id,
    SUM(CASE WHEN player_moves IS NOT NULL THEN 1 ELSE 0 END) AS regular_moves,
    SUM(CASE WHEN special_outcomes = 'Bonus' THEN 1 ELSE 0 END) AS num_bonus_moves,
    SUM(CASE WHEN special_outcomes = 'High Score' THEN 1 ELSE 0 END) AS num_high_scores,
FROM 
    gameplay_logs
WHERE
    play_timestamp::date = '2024-11-03'
GROUP BY
    player_id

Results

player_id	regular_moves	num_bonus_moves	num_high_score
1	2	1	1
2	3	0	2
3	3	2	0
4	2	0	1
5	1	0	0

How This Helps

By tracking these actions, we can quickly identify potential outliers or areas for improvement:

Game Exploits: High counts for bonuses or high scores by certain players could signal potential exploits, prompting adjustments to game mechanics.
Player Engagement: Monitoring these counts lets us gauge feature success. High interaction with outcomes can highlight engagement features, while low interactions show less used ones.

Example 2: Flagging Players with Bonus or High Score Actions

How Knowing Data Flow Helps

Knowing the difference between player_moves (general actions) and special_outcomes (significant events) allows me to simplify the query logic with CTEs that create binary flags for high-impact actions. By structuring these CTEs based on a clear data flow, I can avoid extra joins or filters that would slow down the query. This helps me efficiently flag and aggregate specific actions without sifting through irrelevant data.

Query

Here I’m trying identify players who had at least one bonus and one high score. I used two CTEs to organize my query:

bonus_or_high_score CTE: Flags each action as a bonus or high score with binary columns.
player_bonus_or_high_score CTE: Aggregates these flags to indicate whether each player had at least one bonus or high score.

-- CTE to flag each action as either a bonus or high score within a specific date range
WITH bonus_or_high_score AS (
    SELECT 
        id,
        player_id,
        CASE WHEN special_outcomes = 'Bonus' THEN 1 ELSE 0 END AS bonus_move,
        CASE WHEN special_outcomes = 'High Score' THEN 1 ELSE 0 END AS high_score
    FROM 
        gameplay_logs
    WHERE
        play_timestamp::date = '2024-11-03'
),

-- CTE to check if each player had at least one bonus or high score
player_bonus_or_high_score AS (
    SELECT
        player_id,
        MAX(bonus_move) AS had_bonus,
        MAX(high_score) AS had_high_score
    FROM
        bonus_or_high_score
    GROUP BY
        player_id
)

-- Final query to select players who had both a bonus and a high score
SELECT 
    player_id
FROM
    player_bonus_or_high_score
WHERE
    had_bonus = 1 
    AND had_high_score = 1;

Results

CTE bonus_or_high_score returns:

id	player_id	bonus_move	high_score
1	1	1	0
2	1	0	1
3	2	0	0
4	2	0	1
5	2	0	1
6	3	0	0
7	3	1	0
8	3	1	0
9	4	0	0
10	4	0	1
11	5	0	0

The player_bonus_or_high_score returns:

player_id	had_bonus	had_high_score
1	1	1
2	0	1
3	1	0
4	0	1
5	0	0

The final query would return:

player_id

How This Helps

By identifying players who achieved both a bonus and a high score, we can make targeted product decisions:

Reward Programs: We could offer special rewards or recognition to players who frequently get both outcomes, to improve engagement and loyalty.
Bug Detection: If only certain players achieve both outcomes regularly, we might investigate to ensure there aren’t bugs favoring specific outcomes or player segments.
Feature Adjustments: Understanding which players reach specific milestones can inform product decisions, such as adjusting difficulty levels or designing new challenges that appeal to engaged players.

Conclusion

Understanding how data flows into tables, what each column represents, and the timing of data logging has been essential for writing effective queries. This knowledge allows me to filter precisely and build logic for each dataset’s unique structure. By combining CASE WHEN statements and CTEs, I can segment actions and outcomes efficiently, leading to more actionable insights. Having a good understanding of the data’s structure and purpose helps me create queries that are not only correct but also efficient.

Building Better SQL Queries Through Understanding Data Structure

Why Understanding Data Workflow Matters

Examples

The Table: Tracking Gameplay Actions and Outcomes

Example 1: Writing a Query to Filter for Special Outcomes

How Knowing Data Flow Helps

Query

Results

How This Helps

Example 2: Flagging Players with Bonus or High Score Actions

How Knowing Data Flow Helps

Query

Results

How This Helps

Conclusion

Recent Posts