SCD Type 2 is used to manage dimension data that changes over time by keeping historical versions of records. When a change is detected, instead of updating or deleting the existing record, a new record is inserted to preserve the history of the dimension. The key difference between SCD Type 2 and other SCD types (e.g., Type 1) is that historical data is never lost, making it ideal for businesses where tracking changes is important.
Thank you for reading this post, don't forget to subscribe!Recently i attended an interview for the IDMC cloud, where the interviewer asked me how to implement the Slowly changing dimension (SCD) TYPE 2. So, in this article i’m going to explain easy way that you can remember and tell if you face the same question in your upcoming ETL interviews.
In ETL processes, managing changes in data is critical for maintaining historical accuracy in a data warehouse. Slowly Changing Dimensions (SCDs) are a way to address this challenge. Among the different types of SCD, SCD Type 2 is a widely used technique for tracking historical changes in a dimension table, ensuring that you can trace how an entity (such as a customer, product, or location) has evolved over time.
This article focuses on how to implement SCD Type 2 in ETL workflows, detailing the steps involved and the role of various ETL tools in streamlining the process.
What is SCD Type 2 in ETL?
SCD Type 2 is used to manage dimension data that changes over time by keeping historical versions of records. When a change is detected, instead of updating or deleting the existing record, a new record is inserted to preserve the history of the dimension. The key difference between SCD Type 2 and other SCD types (e.g., Type 1) is that historical data is never lost, making it ideal for businesses where tracking changes is important.
Steps to Implement SCD Type 2 in ETL
1. Extracting the Data
The ETL process begins by extracting data from the source system. This data could come from transactional databases, flat files, APIs, or other sources. At this stage, both the current dimension table and the new incoming data are extracted to compare and determine if any changes have occurred.
2. Comparing Records in the Transformation Stage
During the transformation phase, the incoming data is compared with the existing dimension table. There are three possible scenarios:
- New Record: If the incoming record is not present in the dimension table (based on the business key, such as Customer ID or Product Code), it is inserted as a new record.
- Unchanged Record: If the incoming record matches the current record in the dimension table, no changes are made.
- Changed Record: If the incoming record contains updated information (e.g., a customer’s address has changed), a new version of the record is inserted, and the old record is marked as inactive (with an expiry date or status flag).
3. Inserting New Versions of Records
When a change is detected, SCD Type 2 requires the creation of a new version of the record. This is done by inserting a new row in the dimension table with the updated data. Along with the new values, additional columns are used to track the history:
- Effective Date: The date when the new record becomes effective.
- Expiry Date: The date when the previous record became obsolete.
- Current Flag: A status flag to indicate whether the record is the most current version (e.g.,
Y
for current,N
for historical).
4. Managing Surrogate Keys
Surrogate keys are unique, system-generated keys used to identify each version of a dimension record. Unlike natural keys (such as Customer ID or Product Code), surrogate keys ensure that each version of a record has its own unique identifier. ETL processes manage these surrogate keys to differentiate between different versions of the same entity.
5. Loading the Data
Once the data is transformed and new records are generated (if needed), the final step is loading the data into the dimension table. For SCD Type 2, this typically involves:
- Inserting new records (new versions of changed records or new entries).
- Updating existing records to set the expiry date or status flag for historical versions.
Example of SCD Type 2 Implementation in ETL
Let’s consider an example where you are maintaining customer information. The ETL process detects a change in a customer’s address and inserts a new record while preserving the old one.
Source System Data (Incoming Data):
Customer ID | Customer Name | Address | City | State | Zip Code |
---|---|---|---|---|---|
101 | John Doe | 123 Main St | New York | NY | 10001 |
Existing Dimension Table:
Surrogate Key | Customer ID | Customer Name | Address | City | State | Zip Code | Effective Date | Expiry Date | Current Flag |
---|---|---|---|---|---|---|---|---|---|
1 | 101 | John Doe | 456 Elm St | New York | NY | 10002 | 2022-01-01 | NULL | Y |
Transformation and Load Process:
- The ETL process compares the incoming record (
123 Main St
) with the existing dimension table and detects that the address has changed. - A new record is inserted with the new address, along with an updated
Effective Date
. The existing record is marked as inactive by setting theExpiry Date
and changing theCurrent Flag
toN
.
Updated Dimension Table After ETL Load:
Surrogate Key | Customer ID | Customer Name | Address | City | State | Zip Code | Effective Date | Expiry Date | Current Flag |
---|---|---|---|---|---|---|---|---|---|
1 | 101 | John Doe | 456 Elm St | New York | NY | 10002 | 2022-01-01 | 2023-03-01 | N |
2 | 101 | John Doe | 123 Main St | New York | NY | 10001 | 2023-03-02 | NULL | Y |
Benefits of SCD Type 2 in ETL
- Historical Data Retention: SCD Type 2 ensures that no historical data is lost, making it ideal for reporting and analysis over time.
- Accurate Reporting: By keeping track of every version of a record, SCD Type 2 allows for precise, point-in-time reporting.
- Data Auditing: Organizations can audit changes in their data, which is particularly important for compliance and regulatory purposes.
Challenges of SCD Type 2 in ETL
- Increased Storage Requirements: As each change results in a new record, SCD Type 2 can significantly increase the storage required for dimension tables.
- Complex ETL Logic: The process of detecting changes, inserting new records, and managing historical data can add complexity to the ETL pipeline.
- Query Complexity: Queries on SCD Type 2 tables may require more complex logic, especially when filtering by effective dates or finding the most current records.
SCD Type 2 is a vital approach in data warehousing and ETL processes for preserving historical data while tracking changes in dimensions. By leveraging ETL tools that support SCD Type 2, businesses can maintain accurate, historical records of their dimensions, which is critical for reporting, analytics, and regulatory compliance.
If you’re working with large datasets and need to maintain historical accuracy, implementing SCD Type 2 is a highly effective strategy, despite the added complexity and storage costs.