Delta / Incremental Loads

Delta loading is also known as Incremental loading. Work in progress

 

In the context of ETL, a delta load refers to loading only the changes or updates that have occurred since the last load. Also known as incremental loads. This is in contrast to an initial load, which involves loading the entire dataset from the source. The difference in performance between delta load and initial load depends on factors such as data volume, structure, target, and load type.

An incremental load is the selective movement of data from one system to another. An incremental load pattern will attempt to identify the data that was created or modified since the last time the load process ran. This differs from the conventional full data load, which copies the entire set of data from a given source. The selectivity of the incremental design usually reduces the system overhead required for the ETL process.

WaterMarks

A watermark is a column that has the last updated time stamp or an incrementing key. The delta loading solution loads the changed data between an old watermark and a new watermark.

A Watermark for data synchronization describes an object of a predefined format which provides a point of reference value for two systems/datasets attempting to establish delta/incremental synchronization; any object in the queried data source which was created, modified, or deleted after the watermark's value will be qualified as "above watermark" and should be returned to the client requesting data.

This approach allows the client to retrieve only the objects which have changed since the latest watermark …
https://en.wikipedia.org/wiki/Watermark_(data_synchronization)

Source Change Detection

To implement ETL delta updates, you can use source change detection to retrieve only the new and changed data since the last load process. Related to Change/Modification flags. ETL+ uses row-level insert/update dates, datetime stamps or sequentially incremented columns.

  1. Row-level dates or datetime stamps: Each table has a column to store the date or datetime the data was loaded and/or modified. This method allows each row to self-identify when it was changed. However, this method relies on the source system reliably updating the modification date or datetime column.

 

The selection of data to move is often temporal, based on when the data was created or most recently updated. In some cases, the new or changed data cannot be easily identified solely in the source, so it must be compared to the data already in the destination for the incremental data load to work properly.

 

Conditional update

allows you to selectively update, delete, or insert data based on certain conditions.

update condition

ETL process will load only the data modified after the last successful load. This is called delta load.

Related Pages


Delta data loading from database by using a watermark

 

image-20240530-145459.png