Extract: Retrieve data from source systems (OLTP, files, APIs, etc.)
Transform: Cleanse, validate, and restructure data for analysis
Load: Insert transformed data into the data warehouse
ETL is the backbone of data warehousing, ensuring data quality and consistency for business intelligence and analytics.
ETL Workflow Architecture
Source Systems
→
Extract
→
Staging Area
Staging Area
→
Transform
→
Data Warehouse
Source systems remain operational during extraction
Staging area provides temporary storage for raw data
Transformation rules applied in staging environment
Final data loaded into dimensional model
ETL Workflow Steps
1. Extract Phase
Full extraction vs. Incremental extraction
Change Data Capture (CDC) techniques
Timestamp-based extraction
Log-based extraction
2. Transform Phase
Data cleansing and validation
Data type conversions
Business rule application
Deduplication and aggregation
ETL Workflow - Load Phase
3. Load Phase Strategies
Initial Load: First-time population of data warehouse
Incremental Load: Regular updates with new/changed data
Full Refresh: Complete replacement of target data
-- Example: Incremental Load Strategy
INSERT INTO DimCustomer
SELECT * FROM StagingCustomer
WHERE LastModifiedDate >
(SELECT MAX(LoadDate) FROM ETLControl);
Slowly Changing Dimensions (SCD)
Methods to handle changes in dimensional attributes over time
Type 0: Retain Original (No changes allowed)
Type 1: Overwrite (No history maintained)
Type 2: Add New Row (Full history tracking)
Type 3: Add New Column (Limited history)
Type 4: History Table (Separate historical records)
Type 6: Hybrid (Combination of Types 1+2+3)
SCD Type 1: Overwrite
Characteristics
Simplest approach
No history maintained
Old values overwritten
Minimal storage
Use Cases
Correcting data errors
Non-significant changes
Current value analysis only
-- Example: Update customer address
UPDATE DimCustomer
SET City = 'New Delhi',
State = 'Delhi'
WHERE CustomerKey = 12345;
SCD Type 2: Add New Row
Most commonly used approach for maintaining complete history
CustomerKey
CustomerID
City
StartDate
EndDate
IsCurrent
1001
C123
Mumbai
2020-01-01
2023-05-31
N
1002
C123
Bangalore
2023-06-01
9999-12-31
Y
New row inserted for each change
Tracks complete historical changes
Enables point-in-time analysis
SCD Type 3: Add New Column
Maintains limited history using additional columns
CustomerKey
CustomerID
CurrentCity
PreviousCity
EffectiveDate
1001
C123
Bangalore
Mumbai
2023-06-01
Advantages
Simple queries
Limited storage
Easy comparison
Limitations
Fixed history depth
Not scalable
Multiple changes limited
Surrogate Keys
Definition
System-generated unique identifiers that have no business meaning
Why Use Surrogate Keys?
Independence: Isolated from source system changes
Performance: Integer keys provide faster joins
History Tracking: Enable SCD Type 2 implementation
Integration: Merge data from multiple sources
Consistency: Uniform key structure across dimensions
Fact tables reference dimensions using surrogate keys, ensuring data warehouse stability and performance.
Aggregates & Materialized Views
What are Aggregates?
Pre-calculated summary data stored to improve query performance
Purpose: Speed up analytical queries
Trade-off: Storage space vs. Query performance
Maintenance: Must be refreshed when base data changes
Transparency: Often invisible to end users
Queries that might take minutes on raw data can execute in seconds using aggregates
Materialized Views
Definition
Database objects containing query results stored as physical tables
CREATE MATERIALIZED VIEW MV_SalesByMonth AS
SELECT
d.Year,
d.Month,
p.Category,
SUM(f.SalesAmount) AS TotalSales,
COUNT(f.OrderKey) AS OrderCount
FROM FactSales f
JOIN DimDate d ON f.DateKey = d.DateKey
JOIN DimProduct p ON f.ProductKey = p.ProductKey
GROUP BY d.Year, d.Month, p.Category;
Physically stored aggregated results
Automatically refreshed (configurable)
Query rewrite optimization
Hierarchies in Data Warehousing
Definition
Logical structures organizing data from detailed to summarized levels
Common Hierarchies
Time Hierarchy
Year
↓
Quarter
↓
Month
↓
Day
Geography Hierarchy
Country
↓
State
↓
City
↓
Store
Rollups (Drill-Up)
Concept
Aggregating data to higher levels in a hierarchy
Daily Sales → Roll up to → Monthly Sales
Monthly Sales → Roll up to → Quarterly Sales
Quarterly Sales → Roll up to → Yearly Sales
Benefits
Reduces data volume at higher levels
Faster query performance for summary reports
Enables top-down analysis
Drill-Down Operations
Opposite of rollup - navigating from summary to detailed data
-- Example: Drilling down from Year to Month
SELECT
Year, Quarter, Month,
SUM(SalesAmount) AS TotalSales
FROM FactSales f
JOIN DimDate d ON f.DateKey = d.DateKey
WHERE Year = 2024
GROUP BY Year, Quarter, Month
ORDER BY Year, Quarter, Month;
Enables root cause analysis
Provides detailed insights
Interactive business intelligence
Precomputed Aggregates
Strategy
Calculate and store aggregates during ETL process rather than query time
Types of Precomputed Aggregates
Summary Tables: Permanent aggregate tables
Aggregate Fact Tables: Fact tables at higher grain
Maintain Consistency: Synchronize with base tables
-- Example: Monthly Sales Aggregate Table
CREATE TABLE FactSalesMonthly AS
SELECT
d.YearMonth,
p.CategoryKey,
s.RegionKey,
SUM(f.SalesAmount) AS TotalSales,
SUM(f.Quantity) AS TotalQuantity,
COUNT(*) AS OrderCount
FROM FactSales f
JOIN DimDate d ON f.DateKey = d.DateKey
JOIN DimProduct p ON f.ProductKey = p.ProductKey
JOIN DimStore s ON f.StoreKey = s.StoreKey
GROUP BY d.YearMonth, p.CategoryKey, s.RegionKey;
Aggregate Refresh Strategies
1. Complete Refresh
Drop and rebuild entire aggregate
Simple but resource-intensive
Used for small to medium aggregates
2. Incremental Refresh
Update only changed portions
More complex logic required
Efficient for large aggregates
3. On-Demand Refresh
Update when queried (lazy evaluation)
First query pays performance cost
Aggregate Navigation
Concept
Query optimizer automatically routes queries to appropriate aggregate tables
User Query → Query Optimizer → Select Best Aggregate
Transparent to end users - they query base tables, system uses aggregates