Schema-on-Write vs. Schema-On-Read

In a data engineering pipeline the design of the underlying source files will inevitably change at some point in time. This might be well communicated or it might not be.

Schema evolution is the term used to describe the process of adapting a pipeline to a change in requirements, such as the scenario described above. It could incorporate:

Adding / removing fields
Changing data types

For example, a third party API makes changes to its payload so a field which was previously an integer is now a string. This could then cause errors in the transform stage of an ELT pipeline.

There are 2 key approaches to schema evolution: forward compatibility and backward compatibility.

Forward compatibility means the new data structure is designed to work with older versions of the schema,
Whereas backwards compatibility means the older data is still usable in the new schema design.

Defining a schema

There are two methods to managing schema: schema-on-write or schema-on read.

Feature	Schema on Write	Schema on Read
When Schema is Applied	When writing the data	When reading the data
Flexibility	Inflexible and rigid	Really flexible (think contortionist)
Data Integrity	Enforces data standards, super trustworthy	Not so great, risk of quality issues
Use Case	Traditional relational databases	Data lakes
Performance	Really quick due to predefined structure	Slower as structure is defined on processing
Examples	Oracle, SSMS	Snowflake, Amazon S3

How do I choose which one?

Let's think about what the strengths and weaknesses of each method are.

Schema-on-Write

Pros:

Data quality can definitely be relied on
Loading is quick since structure is defined
If the source data changes nothing will be loaded, so nothing bad can get loaded

Cons:

Resource intensive to make small changes: adding a column could be a long task
You might have to drop all your data to repopulate the new table - if this goes wrong there could be some major issues
We’ve brought everything from the source because its there but might not actually need it

Schema-on-Read

Pros:

Super flexible, great for proof of concept work
Real time data streaming
Allows organizations to work directly with the raw data
Lets analysts look for insights without the initial data structure

Cons:

It can be slow, resource intensive and expensive ( increased cloud platform costs)
Cannot ensure data quality in the same way as with schema on write
Can break a pipeline further down the line than schema on write, potentially damaging trust in the product

Author:

Lydia Wren

View Profile