SQL Anti-Patterns to Avoid on a Modern Analytics Warehouse

What is an anti-pattern?

An anti-pattern is a piece of code that looks reasonable but leads to negative outcomes often related to query execution and optimization.

Modern analytics warehouses, unlike their legacy counterparts, leverage powerful architectural advancements like columnar storage, partitioning, and clustering to handle massive datasets with unprecedented efficiency. These platforms also boast sophisticated query engines with an advanced optimization stage designed to analyze and enhance your SQL, striving for the most efficient execution plan possible.

Despite these intelligent optimizations, how you write your SQL still profoundly impacts performance and cost. Certain coding patterns – known as anti-patterns – can inadvertently undermine these efficiencies, leading to slower queries and wasted compute. This post will highlight common SQL anti-patterns found in modern analytics warehouses and demonstrate how to refactor your queries for optimal performance.

Anti-patterns to look out for

Selecting all columns with SELECT *

SELECT
	*
FROM database.schema.table1;

Problem: Selecting all columns from a warehouse that uses columnar based storage is inefficient and leads to the recalling of unnecessary data.

If a required field is removed or renamed upstream then a SELECT * may not break but instead expose the breaking change to a downstream model that then errors resulting in wasted compute that could have been avoided if caught earlier.

Refactor:

SELECT
	col1, col2, col3
FROM database.schema.table1;

Benefit: By specifying particular columns you reduce the amount of data being called.

SEMI-JOIN without Aggregation

SELECT
	t1.col1
FROM database.schema.table1
WHERE
	t1.col2 NOT IN (SELECT col2 FROM database.schema.table2);

Problem: Using an IN filter without a DISTINCT on the sub query can lead to inefficiencies when there are duplicate values in the subquery.

Refactor:

SELECT
	t1.col1
FROM database.schema.table1
WHERE
	t1.col2 NOT IN (SELECT DISTINCT col2 FROM database.schema.table2);

Benefit: Adding distinct removes duplicates from the subquery ensuring the filter performs better as it does not have to conduct additional computations.

Multiple CTEs Referenced More Than Twice

WITH
	a AS (
		SELECT col1, col2 FROM test WHERE col1 = 'abc'
		),
	b AS (
		SELECT col2 FROM a
		),
	c AS (
		SELECT col1 FROM a
		)
SELECT
	b.col2
	c.col1
FROM b,c;

Problem: The CTE a is referred to more than once which can lead to redundant calculations and performance impacts

Refactor:

WITH
	a AS (
		SELECT col1, col2 FROM test WHERE col1 = 'abc'
		)
SELECT
	a.col2,
	a.col1
FROM a;

Benefit: The CTE is now only referred once and avoids redundant data processing

Using ORDER BY without LIMIT

SELECT
	t.dim1,
	t.dim2,
	t.metric1
FROM database.schema.table AS t
ORDER BY
t.metric1 DESC;

Problem: Using an ORDER BY clause without LIMIT can lead to unnecessary computation, sorting large datasets, increasing compute.

Refactor:

SELECT
	t.dim1,
	t.dim2,
	t.metric1
FROM database.schema.table1 AS t
ORDER BY
t.metric1 DESC
LIMIT 100;

Benefit: Adding a LIMIT ensures that only a subset of the data are sorted improving performance.

Using REGEXP_CONTAINS When LIKE Would Suffice

SELECT
	dim1
FROM database.schema.table1
WHERE
	REGEXP_CONTAINS(dim1, '*.test.*');

Using REGEXP_CONTAINS for wildcard matching is inefficient when LIKE can be used at a lower cost.

Refactor:

SELECT
	dim1
FROM database.schema.table1
WHERE
	dim1 LIKE '%test%';

Benefit: Using LIKE reduces query complexity and is less resource intensive than regular expression parsing.

Finding Latest Records Effectively

Determining the "latest" or "top N" record within groups (e.g., the highest fare per taxi) is a common analytical task. Modern analytics warehouses offer powerful tools, and choosing the most efficient method often depends on data characteristics and how the query optimizer handles specific operations.

1. Leveraging Window Functions for General Efficiency (and large datasets)

On large datasets with high cardinality, window functions are often the most efficient way to recall particular columns for "top N per group" problems. This is particularly true when combined with a QUALIFY clause.

SELECT
taxi_id, trip_seconds, fare
FROM database.schema.table
QUALIFY ROW_NUMBER() OVER (PARTITION BY taxi_id ORDER BY fare DESC, trip_seconds DESC) = 1;

Modern analytical warehouses have highly attuned these functions over time to maximize partitioning and reduce the amount of data scanned. They are designed to process the data in a single, highly optimized pass, efficiently identifying the desired records within each group without the need for multiple full table scans or expensive intermediate join operations. Crucially, for very large partitioned tables, always remember to include a partition filter (e.g., WHERE date_column = '2025-07-20') on your base table when applicable. This allows the query engine to prune unnecessary partitions, avoiding a full table scan and dramatically reducing the amount of data it needs to process. This makes window functions the go-to solution for performance and code clarity in most "top N per group" scenarios.

2. When Aggregation and Join Might Be Considered (Small Aggregated Results)

In contrast, for specific scenarios, particularly when the aggregated result set (e.g., SELECT taxi_id, MAX(fare)) is very small in comparison to the original table, an aggregate-and-join approach might offer competitive performance.

SELECT
    taxi_id, trip_seconds, fare
FROM database.schema.table
WHERE
    (taxi_id, fare) IN (
        SELECT taxi_id, MAX(fare)
        FROM database.schema.table
        GROUP BY 1
);

On platforms like Snowflake and Google BigQuery, if the subquery's result set is truly small, the optimizer might execute a broadcast join. This involves copying the small aggregated data (the taxi_id, MAX(fare) pairs) to all worker nodes and joining it locally with the distributed full table. This can sometimes be very fast for the join phase. However, this method still requires an initial full scan and aggregation of the entire table to produce that small result set, which can be inefficient for large base tables if partition filtering isn't also applied.

Considerations:

While the aggregate-and-join approach can benefit from broadcast joins on the second step, the window function generally remains the more robust and consistently performant solution for "top N per group" in modern analytics warehouses, as it often optimizes the entire process into fewer passes. Always check your query execution plan to understand how your chosen database is processing your specific query on your data.

Using Dynamic WHERE Conditions

SELECT
	*
FROM
	comments AS c
INNER JOIN users AS u ON c.user_id = u.id
WHERE
	u.id IN (
		SELECT
			id
		FROM users
		WHERE location LIKE '%New York'
		GROUP BY 1
		ORDER BY SUM(up_votes) DESC
		LIMIT 10
);

Problem: This code has potential performance issues where a re-evaluation of the subquery is required for each row.

Refactor:

WITH top_users AS (
	SELECT
		id
	FROM users
	WHERE location LIKE '%New York'
	GROUP BY 1
	ORDER BY SUM(up_votes) DESC
	LIMIT 10
)

SELECT
	*
FROM comments AS c
INNER JOIN users AS u ON c.user_id = u.id
WHERE
	u.id IN (SELECT id FROM top_users);

Benefit: By materializing the subquery results into a CTE, we avoid re-executing the subquery for each row.

WHERE Clause Ordering; most selective expression first

SELECT
	repo_name,
	id,
	ref
FROM database.schema.id
WHERE
	ref LIKE '%master%'
	AND repo_name = 'cdnjs/cdnjs';

Problem: the LIKE condition is less selective compared to the equality condition. We should reorder the filtering to apply the more selective filter initially.

Refactor:

SELECT
	repo_name,
	id,
	ref
FROM database.schema.id
WHERE
	repo_name = 'cdnjs/cdnjs'
	AND ref LIKE '%master%';

Benefit: Applying the more selective filter first reduces the number of rows that need to be checked by the LIKE condition improving performance.

Join Order

SELECT
	t1.station_id,
	COUNT(1) AS num_trips_started
FROM `bigquery-public-data.austin_bikeshare.bikeshare_stations` AS t1
JOIN `bigquery-public-data.austin_bikeshare.bikeshare_trips` AS t2 ON t1.station_id = t2.station_id
GROUP BY
	t1.station_id;

Problem: Joining large first can lead to inefficient queries. It is better to join the smaller tables to reduce the data volume.

Refactor:

SELECT
	t1.station_id,
	COUNT(1) AS num_trips_started
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` AS t2 
INNER JOIN `bigquery-public-data.austin_bikeshare.bikeshare_stations` AS t1 ON t1.station_id = t2.station_id
GROUP BY
	t1.station_id;

Benefit: By reordering the join the query processes the larger table later in the execution plan, after the join to the smaller table has been processed. Reducing early data handling and improving performance.

Author:
Edward Hayter
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2025 The Information Lab