In data analysis, the focus is often on mastering tools or methods. But understanding the structure and flow of data is just as crucial, especially when filtering for specific rows or identifying patterns. In this post, I’ll show how understanding data structure and purpose can lead to efficient, effective queries, sharing two real-life examples of how this knowledge has helped me write more targeted SQL queries.
Note: Before analyzing data flow, I always review the entire table and any related documentation to understand the data context fully.
Why Understanding Data Workflow Matters
Before writing any query, I make sure to understand how data flows into each table and how each column is populated. Knowing this data workflow gives me insights into several key areas:
Data Completeness and Reliability: By knowing when and how data is logged, I can determine whether certain actions or outcomes might be missing, duplicated, or delayed.
Effective Filtering: Understanding which fields capture specific actions allows me to filter more accurately, making sure I’m focusing on the right data.
Efficient Query Design: With a clear idea of how data flows into each column, I can use CASE WHEN statements or aggregations effectively, keeping it simple and reducing query run time.
Data Integrity Checks: When I understand the expected data flow, I can spot inconsistencies, like unexpected timestamps or missing records, that could signal issues in data collection.
Knowing these aspects of the data helps me design effective, accurate queries and avoid misinterpretations. Below, I’ll show some examples of how I apply these principles using CASE WHEN statements and CTEs.
Examples
Note: In the following examples, I’ve limited the query results to show just a few rows from a sample dataset that captures gameplay actions occurring within a few minutes on a single day. However, the query would get all of the data from 2024-11-03, so in practice, it would return a lot more data. This is also written in PostgreSQL syntax.
The Table: Tracking Gameplay Actions and Outcomes
Let’s say we have a table, gameplay_logs, that tracks player actions in a game. Every action is recorded in player_moves , and certain significant outcomes (e.g. bonuses or high scores) are recorded in special_outcomes. This lets us distinguish regular actions from significant outcomes like bonuses and high scores. The id is a unique id given to every game, player_id the unique player id, and play_timestamp column is the UTC timestamp the recorded action.
id | player_id | player_moves | special_outcomes | play_timestamp |
1 | 1 | Move A | Bonus | 2024-11-03 15:23:45 UTC |
2 | 1 | Move B | High Score | 2024-11-03 15:23:50 UTC |
3 | 2 | Move A | 2024-11-03 15:24:35 UTC | |
4 | 2 | Move B | High Score | 2024-11-03 15:25:00 UTC |
5 | 2 | Move C | High Score | 2024-11-03 15:25:25 UTC |
6 | 3 | Move A | 2024-11-03 15:25:50 UTC | |
7 | 3 | Move B | Bonus | 2024-11-03 15:26:15 UTC |
8 | 3 | Move C | Bonus | 2024-11-03 15:26:40 UTC |
9 | 4 | Move A | 2024-11-03 15:27:05 UTC | |
10 | 4 | Move C | High Score | 2024-11-03 15:27:30 UTC |
11 | 5 | Move B | 2024-11-03 15:27:55 UTC |
Example 1: Writing a Query to Filter for Special Outcomes
How Knowing Data Flow Helps
Understanding that special_outcomes only logs significant actions like bonuses and high scores allows me to write more accurate filters. By using CASE WHEN statements only on columns that contain the specific actions I’m interested in, I avoid double-counting or misinterpreting regular moves. This ensures I’m correctly calculating the frequency of meaningful actions, leading to cleaner, more focused results.
Query
To identify certain player behaviors and patterns, I could track the number of each action type (e.g., regular moves, bonus moves, and high scores) for each player over a time period. So I’m getting for each player their count of regular moves, bonuses, and high scores. Using CASE WHEN with SUM, I can calculate how often each player achieves specific outcomes.
SELECT
player_id,
SUM(CASE WHEN player_moves IS NOT NULL THEN 1 ELSE 0 END) AS regular_moves,
SUM(CASE WHEN special_outcomes = 'Bonus' THEN 1 ELSE 0 END) AS num_bonus_moves,
SUM(CASE WHEN special_outcomes = 'High Score' THEN 1 ELSE 0 END) AS num_high_scores,
FROM
gameplay_logs
WHERE
play_timestamp::date = '2024-11-03'
GROUP BY
player_id
Results
player_id | regular_moves | num_bonus_moves | num_high_score |
1 | 2 | 1 | 1 |
2 | 3 | 0 | 2 |
3 | 3 | 2 | 0 |
4 | 2 | 0 | 1 |
5 | 1 | 0 | 0 |
How This Helps
By tracking these actions, we can quickly identify potential outliers or areas for improvement:
Game Exploits: High counts for bonuses or high scores by certain players could signal potential exploits, prompting adjustments to game mechanics.
Player Engagement: Monitoring these counts lets us gauge feature success. High interaction with outcomes can highlight engagement features, while low interactions show less used ones.
Example 2: Flagging Players with Bonus or High Score Actions
How Knowing Data Flow Helps
Knowing the difference between player_moves (general actions) and special_outcomes (significant events) allows me to simplify the query logic with CTEs that create binary flags for high-impact actions. By structuring these CTEs based on a clear data flow, I can avoid extra joins or filters that would slow down the query. This helps me efficiently flag and aggregate specific actions without sifting through irrelevant data.
Query
Here I’m trying identify players who had at least one bonus and one high score. I used two CTEs to organize my query:
bonus_or_high_score CTE: Flags each action as a bonus or high score with binary columns.
player_bonus_or_high_score CTE: Aggregates these flags to indicate whether each player had at least one bonus or high score.
-- CTE to flag each action as either a bonus or high score within a specific date range
WITH bonus_or_high_score AS (
SELECT
id,
player_id,
CASE WHEN special_outcomes = 'Bonus' THEN 1 ELSE 0 END AS bonus_move,
CASE WHEN special_outcomes = 'High Score' THEN 1 ELSE 0 END AS high_score
FROM
gameplay_logs
WHERE
play_timestamp::date = '2024-11-03'
),
-- CTE to check if each player had at least one bonus or high score
player_bonus_or_high_score AS (
SELECT
player_id,
MAX(bonus_move) AS had_bonus,
MAX(high_score) AS had_high_score
FROM
bonus_or_high_score
GROUP BY
player_id
)
-- Final query to select players who had both a bonus and a high score
SELECT
player_id
FROM
player_bonus_or_high_score
WHERE
had_bonus = 1
AND had_high_score = 1;
Results
CTE bonus_or_high_score returns:
id | player_id | bonus_move | high_score |
1 | 1 | 1 | 0 |
2 | 1 | 0 | 1 |
3 | 2 | 0 | 0 |
4 | 2 | 0 | 1 |
5 | 2 | 0 | 1 |
6 | 3 | 0 | 0 |
7 | 3 | 1 | 0 |
8 | 3 | 1 | 0 |
9 | 4 | 0 | 0 |
10 | 4 | 0 | 1 |
11 | 5 | 0 | 0 |
The player_bonus_or_high_score returns:
player_id | had_bonus | had_high_score |
1 | 1 | 1 |
2 | 0 | 1 |
3 | 1 | 0 |
4 | 0 | 1 |
5 | 0 | 0 |
The final query would return:
player_id |
1 |
How This Helps
By identifying players who achieved both a bonus and a high score, we can make targeted product decisions:
Reward Programs: We could offer special rewards or recognition to players who frequently get both outcomes, to improve engagement and loyalty.
Bug Detection: If only certain players achieve both outcomes regularly, we might investigate to ensure there aren’t bugs favoring specific outcomes or player segments.
Feature Adjustments: Understanding which players reach specific milestones can inform product decisions, such as adjusting difficulty levels or designing new challenges that appeal to engaged players.
Conclusion
Understanding how data flows into tables, what each column represents, and the timing of data logging has been essential for writing effective queries. This knowledge allows me to filter precisely and build logic for each dataset’s unique structure. By combining CASE WHEN statements and CTEs, I can segment actions and outcomes efficiently, leading to more actionable insights. Having a good understanding of the data’s structure and purpose helps me create queries that are not only correct but also efficient.