In a data engineering pipeline the design of the underlying source files will inevitably change at some point in time. This might be well communicated or it might not be.
Schema evolution is the term used to describe the process of adapting a pipeline to a change in requirements, such as the scenario described above. It could incorporate:
- Adding / removing fields
- Changing data types
For example, a third party API makes changes to its payload so a field which was previously an integer is now a string. This could then cause errors in the transform stage of an ELT pipeline.
There are 2 key approaches to schema evolution: forward compatibility and backward compatibility.
- Forward compatibility means the new data structure is designed to work with older versions of the schema,
- Whereas backwards compatibility means the older data is still usable in the new schema design.
Defining a schema
There are two methods to managing schema: schema-on-write or schema-on read.
Feature | Schema on Write | Schema on Read |
---|---|---|
When Schema is Applied | When writing the data | When reading the data |
Flexibility | Inflexible and rigid | Really flexible (think contortionist) |
Data Integrity | Enforces data standards, super trustworthy | Not so great, risk of quality issues |
Use Case | Traditional relational databases | Data lakes |
Performance | Really quick due to predefined structure | Slower as structure is defined on processing |
Examples | Oracle, SSMS | Snowflake, Amazon S3 |
How do I choose which one?
Let's think about what the strengths and weaknesses of each method are.
Schema-on-Write
Pros:
- Data quality can definitely be relied on
- Loading is quick since structure is defined
- If the source data changes nothing will be loaded, so nothing bad can get loaded
Cons:
- Resource intensive to make small changes: adding a column could be a long task
- You might have to drop all your data to repopulate the new table - if this goes wrong there could be some major issues
- We’ve brought everything from the source because its there but might not actually need it
Schema-on-Read
Pros:
- Super flexible, great for proof of concept work
- Real time data streaming
- Allows organizations to work directly with the raw data
- Lets analysts look for insights without the initial data structure
Cons:
- It can be slow, resource intensive and expensive ( increased cloud platform costs)
- Cannot ensure data quality in the same way as with schema on write
- Can break a pipeline further down the line than schema on write, potentially damaging trust in the product