From CSV to Star Schema: Modelling Sample Superstore using dbt & Snowflake

Here, I discuss a project in which I normalised Sample Superstore's CSV and then denormalised it into a star schema, using dbt and Snowflake.

I started modelling in PostgreSQL, which quickly began to feel limiting for what I wanted to do. I wanted to build a basic yet modern analytics engineering workflow, making use of cloud data warehousing, modularised SQL, version control and reproducible pipelines.

The aim was to learn how and why the same source data can and should be modelled differently depending on the use case.

In particular, I wanted to better understand two ideas:

1. OLTP-like normalised modelling
2. OLAP-like fact/dimensional modelling

Why Snowflake and dbt?

Snowflake and dbt overcame many limitations that I found when just using PostgreSQL.

Snowflake

Snowflake gave the project a cloud warehouse, rather than a database running locally on my device. Data teams usually do not build reports from local databases on a person's machine.

For the project, the raw Superstore CSV was loaded into Snowflake as a table. The table then became the starting point for the dbt-portion of the project.

dbt

PostgreSQL let me write SQL and store data, but the workflow relied on separate scripts, manual execution ordering, hard-coded table names and DDL (data definition language).

In dbt, each SQL script is a model. dbt can build all the models with a single command, in the correct order, based on the dependencies between them and without having to manually write any DDL. Models can reference one another by name, meaning that schema and table names do not need to be assigned in the warehouse beforehand.

Project-level configurations live in YAML, .yml, files. This allows for project-wide decisions to be made once, in one place, including which schema to build model layers into or deciding whether a model layer should be built as a set of tables or views.

Project Overview

The project has four transformation layers: raw, staging, intermediate and marts.

The raw layer contains the original Sample Superstore flat file. I manually loaded the CSV into Snowflake as a table. The raw Snowflake table was referenced within dbt using source().

The staging layer was the first wholly dbt layer. A staging layer's general purpose is to clean up raw data, i.e. by renaming columns or changing datatypes, making the source easier to work with.

The intermediate layer is where most of the modelling happened. The staging model was split into separate entities such as products, categories, customers, locations, dates, order items. The intermediate models reference the staging model using ref(), telling dbt that the intermediate models depend on the staging model, and allowing it to refer to the output of the staging layer by name.

The marts layer is the final, analytics-ready layer. The normalised, intermediate tables were partly denormalised and reshaped into a star schema.

Normalisation, denormalisation, OLTP and OLAP

The original flat file has a lot of repeat information. The same product name, segment, customer name, segment, etc., repeats across many rows.

Normalisation is the process of organising data into separate tables, representing entities, to reduce data duplication and improve consistency. Instead of storing all information in one big table, repeated entities are separated and connected through keys.

A primary key uniquely identifies a row in a table. A foreign key points to the primary key of another table. If a product belongs to a sub-category, the product table can store sub_category_id as a foreign key, which would point to the primary key in the sub-category table.

This avoids needing to store the same descriptive information repeatedly. Rather than storing category_name on every row, a normalised model can store all category names once in a category table, and then reference them through keys.

This sort of modelling is associated with OLTP (Online Transaction Processing) systems. OLTP systems form the backbone of day-to-day processes that allow businesses to function, like recording transactions as soon as they happen. They are designed to support accurate inserts, updates and deletes.

When modelling in an OLTP-way, questions like the following ought to be asked:

How can repeated data be reduced?
What separate entity types exist?
How can they be related to one another?

In contrast, OLAP (Online Analytical Processing) systems are designed for analysis and reporting. The stress is not on accurate and efficient write queries, but on easy-to-use and quick read queries. Star and snowflake schemas are dimensional modelling structures commonly used in OLAP systems. My project partly denormalises the intermediate/normalised layer in order to make the tables more conducive to analytics.

When modelling in an OLAP-way, questions like these should be asked:

What are we measuring?
What (dimensions) do we want to analyse it by?

To illustrate why denormalisation is preferred for OLAP, let's answer the same question using both the intermediate (normalised) and marts (star schema): which customer bought the highest quantity of technology products in the West region?

See below for more information on the tables for both layers and their schemas. Using the OLTP-like, normalised layer, the query is the following:

  SELECT
    c.customer_id,
    c.customer_name,
    r.region_name,
    cat.category_name,
    SUM(oi.quantity) AS total_quantity_bought,
    SUM(oi.sales) AS total_sales,
    SUM(oi.profit) AS total_profit

FROM SUPERSTORE_DB.DBT_SWADHIA_INTERMEDIATE.INT_ORDER_ITEMS AS oi

LEFT JOIN SUPERSTORE_DB.DBT_SWADHIA_INTERMEDIATE.INT_ORDERS AS o
    ON oi.order_id = o.order_id

LEFT JOIN SUPERSTORE_DB.DBT_SWADHIA_INTERMEDIATE.INT_CUSTOMERS AS c
    ON o.customer_id = c.customer_id

LEFT JOIN SUPERSTORE_DB.DBT_SWADHIA_INTERMEDIATE.INT_LOCATIONS AS l
    ON o.location_id = l.location_id

LEFT JOIN SUPERSTORE_DB.DBT_SWADHIA_INTERMEDIATE.INT_STATES AS st
    ON l.state_id = st.state_id

LEFT JOIN SUPERSTORE_DB.DBT_SWADHIA_INTERMEDIATE.INT_REGIONS AS r
    ON st.region_id = r.region_id

LEFT JOIN SUPERSTORE_DB.DBT_SWADHIA_INTERMEDIATE.INT_PRODUCTS AS p
    ON oi.product_id = p.product_id

LEFT JOIN SUPERSTORE_DB.DBT_SWADHIA_INTERMEDIATE.INT_SUB_CATEGORIES AS sc
    ON p.sub_category_id = sc.sub_category_id

LEFT JOIN SUPERSTORE_DB.DBT_SWADHIA_INTERMEDIATE.INT_CATEGORIES AS cat
    ON sc.category_id = cat.category_id

WHERE cat.category_name = 'Technology'
  AND r.region_name = 'West'

GROUP BY
    c.customer_id,
    c.customer_name,
    r.region_name,
    cat.category_name

ORDER BY
    total_quantity_bought DESC

LIMIT 1;

And for the OLAP-like, marts layer:

  SELECT
    c.customer_id,
    c.customer_name,
    l.region_name,
    p.category_name,
    SUM(f.quantity) AS total_quantity_bought,
    SUM(f.sales) AS total_sales,
    SUM(f.profit) AS total_profit

FROM SUPERSTORE_DB.DBT_SWADHIA_MARTS.FCT_ORDER_ITEMS AS f

LEFT JOIN SUPERSTORE_DB.DBT_SWADHIA_MARTS.DIM_CUSTOMERS AS c
    ON f.customer_id = c.customer_id

LEFT JOIN SUPERSTORE_DB.DBT_SWADHIA_MARTS.DIM_PRODUCTS AS p
    ON f.product_id = p.product_id

LEFT JOIN SUPERSTORE_DB.DBT_SWADHIA_MARTS.DIM_LOCATIONS AS l
    ON f.location_id = l.location_id

WHERE p.category_name = 'Technology'
  AND l.region_name = 'West'

GROUP BY
    c.customer_id,
    c.customer_name,
    l.region_name,
    p.category_name

ORDER BY
    total_quantity_bought DESC

LIMIT 1;

9 tables versus 4.

SQL queries against star schemas generally require fewer joins, are easier to understand, and are often better suited to analytical aggregations.

Building the Normalised Intermediate Layer

The granularity of the raw CSV is one row per order item.

See the schema below of the normalised layer. I discuss how I obtained the tables for the 'product hierarchy': categories, sub_categories, and products.

Starting with the categories table:

  select
    row_number() over (order by category_name) as category_id, -- index or primary key for categories
    category_name

from (
    select distinct
        trim(category) as category_name
    from {{ ref('stg_superstore') }}      -- reference to the output of another sql file, as opposed to hard-coded table
    where category is not null
)

Where the above code is from dbt and creates a row per category. Notice the FROM clause of the subquery:

  from {{ ref('stg_superstore') }}

Which is a reference to a .sql file, or model with the name stg_superstore (the staging model). Tables and schemas do not need to be created beforehand and assigned a fixed name, when using dbt. Models reference the names of other models. The above query is materialising a view within Snowflake and yet notice the absence of DDL.

That query or model is named int_categories and is referenced in the model creating the normalised sub-categories table:

  with sub_categories as (

    select distinct
        trim(category) as category_name,
        trim(sub_category) as sub_category_name

    from {{ ref('stg_superstore') }}  -- reference to staging model

    where category is not null
      and sub_category is not null

),

categories as (

    select *
    from {{ ref('int_categories') }}   -- reference to int_categories model

)

select
    row_number() over (
        order by s.category_name, s.sub_category_name
    ) as sub_category_id,               -- primary key for sub-categories

    c.category_id,
    s.sub_category_name,
    s.category_name

from sub_categories as s

left join categories as c
    on s.category_name = c.category_name

This model, called int_sub_categories, extracts distinct sub-categories and links each one to its parent category. sub_category_id is the primary key of the sub-category table, and category_id is a foreign key pointing back to int_categories.

It gives a table like the following:

sub_category_id	category_id	sub_category_name
1	1	Bookcases
2	1	Chairs
3	1	Furnishings
4	2	Appliances
5	2	Binders

And the products model is defined as:

  select
    product_id,
    product_name,
    sub_category_id

from (
    select
        trim(product_id) as product_id,
        trim(product_name) as product_name,
        trim(category) as category_name,
        trim(sub_category) as sub_category_name,

        row_number() over (
            partition by trim(product_id)
            order by trim(product_name)
        ) as row_num

    from {{ ref('stg_superstore') }}

    where product_id is not null
) as p

left join {{ ref('int_sub_categories') }} as sc
    on p.sub_category_name = sc.sub_category_name

where row_num = 1;

Which gives a table like the following:

product_id	product_name	sub_category_id
FUR-BO-10001798	Bush Somerset Collection Bookcase	1
FUR-CH-10000454	Hon Deluxe Fabric Upholstered Chair	2
OFF-BI-10003910	DXL Angle-View Binders	5

Measuring Redundancy Reduction

Normalisation reduces data redundancy. Repeated values are placed in separate tables and stored once.

To measure the reduction in redundancy, I compared the count of occurrences of an entity in the CSV to the number of rows in the corresponding normalised table. I only show a portion of the query due to its length:

  SELECT
    entity,
    flat_occurrences,
    normalised_rows,
    flat_occurrences - normalised_rows AS redundant_occurrences_removed,
    ROUND(
        100 * (flat_occurrences - normalised_rows)
        / NULLIF(flat_occurrences, 0),
        2
    ) AS redundancy_reduction_pct

FROM redundancy_metrics

ORDER BY
    redundancy_reduction_pct DESC;

The omitted portion of the query creates a CTE called redundancy_metrics, which calculates the number of occurrences of an entity in the flat file and compares this with the number of rows in the entity's normalised table. A series of unions stacks all values vertically. The query segment shown calculates the difference in an entity's occurrence within the CSV to its normalised table and expresses the difference as a percentage, giving this table:

ENTITY	FLAT_OCCURRENCES	NORMALISED_ROWS	REDUNDANT_OCCURRENCES_REMOVED	REDUNDANCY_REDUCTION_PCT
countries	9994	1	9993	99.99
categories	9994	3	9991	99.97
segments	9994	3	9991	99.97
regions	9994	4	9990	99.96
ship_modes	9994	4	9990	99.96
sub_categories	9994	17	9977	99.83
states	9994	49	9945	99.51
locations	9994	632	9362	93.68
dates	19988	1434	18554	92.83
customers	9994	793	9201	92.07
products	9994	1862	8132	81.37
orders	9994	5009	4985	49.88

Building the Star Schema

Again, a normalised schema is not conducive to analytics. To answer a simple question like, 'What were the total sales by category?,' would require joining order items to products, products to sub-categories and sub-categories to categories.

The marts layer of the project makes answering questions like this easier by partly denormalising the intermediate layer to create a star schema, with a central fact table surrounded by dimension tables.

As examples, I show how I obtained the products dimension table and the central fact table. Starting with products:

  select
    p.product_id,
    p.product_name,

    sc.sub_category_id,
    sc.sub_category_name,

    c.category_id,
    c.category_name

from {{ ref('int_products') }} as p

left join {{ ref('int_sub_categories') }} as sc
    on p.sub_category_id = sc.sub_category_id

left join {{ ref('int_categories') }} as c
    on sc.category_id = c.category_id

Wherein all three intermediate 'product hierarchy,' normalised tables have been referenced. This gives analysts one product dimension containing product, sub-category and category information. Instead of joining to three tables, they only need to join to dim_products.

And the fact table is given by:

  select
    oi.row_id,

    o.order_id,
    o.order_date_id,
    o.ship_date_id,

    o.customer_id,
    o.location_id,
    o.ship_mode_id,

    oi.product_id,

    oi.sales,
    oi.quantity,
    oi.discount,
    oi.profit

from {{ ref('int_order_items') }} as oi

left join {{ ref('int_orders') }} as o
    on oi.order_id = o.order_id

Validating the Transformations

There would have been no point in normalising and then denormalising if the transformations modified values. I used several queries to verify that the transformations did not change the underlying totals, one of which follows:

  WITH staging_totals AS (

    SELECT
        COUNT(*) AS row_count,
        SUM(sales) AS total_sales,
        SUM(profit) AS total_profit,
        SUM(quantity) AS total_quantity,
        SUM(discount) AS total_discount

    FROM SUPERSTORE_DB.DBT_SWADHIA_STAGING.STG_SUPERSTORE
),

fact_totals AS (

    SELECT
        COUNT(*) AS row_count,
        SUM(sales) AS total_sales,
        SUM(profit) AS total_profit,
        SUM(quantity) AS total_quantity,
        SUM(discount) AS total_discount

    FROM SUPERSTORE_DB.DBT_SWADHIA_MARTS.FCT_ORDER_ITEMS
)

SELECT
    'Row count' AS metric,
    s.row_count AS staging_value,
    f.row_count AS fact_value,
    s.row_count - f.row_count AS difference

FROM staging_totals AS s
CROSS JOIN fact_totals AS f

UNION ALL

SELECT
    'Total sales',
    s.total_sales,
    f.total_sales,
    s.total_sales - f.total_sales

FROM staging_totals AS s
CROSS JOIN fact_totals AS f

UNION ALL

SELECT
    'Total profit',
    s.total_profit,
    f.total_profit,
    s.total_profit - f.total_profit

FROM staging_totals AS s
CROSS JOIN fact_totals AS f

UNION ALL

SELECT
    'Total quantity',
    s.total_quantity,
    f.total_quantity,
    s.total_quantity - f.total_quantity

FROM staging_totals AS s
CROSS JOIN fact_totals AS f

UNION ALL

SELECT
    'Total discount',
    s.total_discount,
    f.total_discount,
    s.total_discount - f.total_discount

FROM staging_totals AS s
CROSS JOIN fact_totals AS f;

Which yields the following:

METRIC	STAGING_VALUE	FACT_VALUE
Row count	9994.0000	9994.0000
Total sales	2297200.8603	2297200.8603
Total profit	286397.0217	286397.0217
Total quantity	37873.0000	37873.0000
Total discount	1561.0900	1561.0900

Which shows that no differences have been introduced between the staging and marts layer.

Conclusion

The same CSV was modelled in two different ways to serve different purposes. The normalised layer broke apart the flat file into separate entities, thereby reducing redundancy. The star schema, instead, prioritised easier-to-write queries.

The validation queries showed that the change of structure did not change the underlying values.

The project remains basic. Next steps include adding tests and documentation, connecting the marts layer to a BI tool, and exploring whether an MCP layer could make the model queryable through natural language.

Author:

Shivam Wadhia

View Profile