Schema-on-Write vs. Schema-On-Read

In a data engineering pipeline the design of the underlying source files will inevitably change at some point in time. This might be well communicated or it might not be.

Schema evolution is the term used to describe the process of adapting a pipeline to a change in requirements, such as the scenario described above. It could incorporate:

  • Adding / removing fields
  • Changing data types

For example, a third party API makes changes to its payload so a field which was previously an integer is now a string. This could then cause errors in the transform stage of an ELT pipeline.

There are 2 key approaches to schema evolution: forward compatibility and backward compatibility. 

  • Forward compatibility means the new data structure is designed to work with older versions of the schema,
  • Whereas backwards compatibility means the older data is still usable in the new schema design.

Defining a schema

There are two methods to managing schema: schema-on-write or schema-on read.

Feature Schema on Write Schema on Read
When Schema is Applied When writing the data When reading the data
Flexibility Inflexible and rigid Really flexible (think contortionist)
Data Integrity Enforces data standards, super trustworthy Not so great, risk of quality issues
Use Case Traditional relational databases Data lakes
Performance Really quick due to predefined structure Slower as structure is defined on processing
Examples Oracle, SSMS Snowflake, Amazon S3

How do I choose which one?

Let's think about what the strengths and weaknesses of each method are.

Schema-on-Write

Pros:

  • Data quality can definitely be relied on
  • Loading is quick since structure is defined
  • If the source data changes nothing will be loaded, so nothing bad can get loaded

Cons:

  • Resource intensive to make small changes: adding a column could be a long task
  • You might have to drop all your data to repopulate the new table - if this goes wrong there could be some major issues
  • We’ve brought everything from the source because its there but might not actually need it

Schema-on-Read

Pros:

  • Super flexible, great for proof of concept work
  • Real time data streaming
  • Allows organizations to work directly with the raw data 
  • Lets analysts look for insights without the initial data structure 

Cons:

  • It can be slow, resource intensive and expensive ( increased cloud platform costs)
  • Cannot ensure data quality in the same way as with schema on write
  • Can break a pipeline further down the line than schema on write, potentially damaging trust in the product

Author:
Lydia Wren
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2025 The Information Lab