ETL+ Upsert Concepts & Logic
Upsert refers to the process of inserting new data or updating existing data in a target (data warehouse) table based on matching criteria. The term "Upsert" is a combination of "Update" and "Insert". The limitation of Upsert is that it does not Delete old records.
Purpose and Goal of Upsert
With Upsert, you can ensure that your target database contains the newest most current and accurate data, without having to reload the entire data set every time. Upsert updates or inserts rows – Upsert does not delete/remove records. The Upsert operation updates only the new or modified data instead of updating the entire dataset. They are commonly used in ETL operations where large volumes of data are involved. This approach can save time, reduce network bandwidth, and improve processing efficiency.
Upsert ensures that your target database contains the most current and accurate data, without having to reload the entire data set every time. Upsert is a database operation that updates an existing row if a specified value already exists in a table, and inserts a new row if the specified value does not already exist.
Upserts and Delta Updates.
Delta update is a data processing technique that updates only the changes or incremental updates that have occurred since the last data update. Upsert is a limited function type of Delta update that does not remove old records. Upsert requires a source table with a field with a chronological order such as a date, datetime stamp, or a Sequentially Incremented Values .
Logic of Operations
Upsert is a database operation that updates an existing row if a specified value already exists in a table, and inserts a new row if the specified value does not already exist.
The Upsert used in ETL+ assumes that the data in the data source has a mix of new and old records. Some are new or updated records and others are records that already exist in the target data warehouse table.
ETL+'s Upsert begins identifying the changes that have occurred since the last update. The changes are tracked and identified using a timestamp or a sequence number in the Column with Last Modified Value which identifies the records that have been modified since the last update.