ETL Processing & Aggregates

Dimensional Database Design - Unit V

Dr. Mohsin Dar

Assistant Professor

Cloud & Software Operations Cluster | SOCS

University of Petroleum and Energy Studies (UPES)

MTech - Database Systems | First Semester

Today's Agenda

ETL Processing Fundamentals
ETL Workflow Architecture
Slowly Changing Dimensions (SCD Types)
Surrogate Keys and their Importance
Aggregates and Materialized Views
Hierarchies and Rollups
Precomputed Aggregates

What is ETL?

Extract, Transform, Load

Extract: Retrieve data from source systems (OLTP, files, APIs, etc.)
Transform: Cleanse, validate, and restructure data for analysis
Load: Insert transformed data into the data warehouse

ETL is the backbone of data warehousing, ensuring data quality and consistency for business intelligence and analytics.

ETL Workflow Architecture

Source Systems

→

Extract

→

Staging Area

→

Transform

→

Data Warehouse

Source systems remain operational during extraction
Staging area provides temporary storage for raw data
Transformation rules applied in staging environment
Final data loaded into dimensional model

ETL Workflow Steps

1. Extract Phase

Full extraction vs. Incremental extraction
Change Data Capture (CDC) techniques
Timestamp-based extraction
Log-based extraction

2. Transform Phase

Data cleansing and validation
Data type conversions
Business rule application
Deduplication and aggregation

ETL Workflow - Load Phase

3. Load Phase Strategies

Initial Load: First-time population of data warehouse
Incremental Load: Regular updates with new/changed data
Full Refresh: Complete replacement of target data

-- Example: Incremental Load Strategy
INSERT INTO DimCustomer
SELECT * FROM StagingCustomer
WHERE LastModifiedDate > 
    (SELECT MAX(LoadDate) FROM ETLControl);
            

Slowly Changing Dimensions (SCD)

Methods to handle changes in dimensional attributes over time

Type 0: Retain Original (No changes allowed)
Type 1: Overwrite (No history maintained)
Type 2: Add New Row (Full history tracking)
Type 3: Add New Column (Limited history)
Type 4: History Table (Separate historical records)
Type 6: Hybrid (Combination of Types 1+2+3)

SCD Type 1: Overwrite

Characteristics

Simplest approach
No history maintained
Old values overwritten
Minimal storage

Use Cases

Correcting data errors
Non-significant changes
Current value analysis only

-- Example: Update customer address
UPDATE DimCustomer
SET City = 'New Delhi', 
    State = 'Delhi'
WHERE CustomerKey = 12345;
            

SCD Type 2: Add New Row

Most commonly used approach for maintaining complete history

CustomerKey	CustomerID	City	StartDate	EndDate	IsCurrent
1001	C123	Mumbai	2020-01-01	2023-05-31	N
1002	C123	Bangalore	2023-06-01	9999-12-31	Y

New row inserted for each change
Tracks complete historical changes
Enables point-in-time analysis

SCD Type 3: Add New Column

Maintains limited history using additional columns

CustomerKey	CustomerID	CurrentCity	PreviousCity	EffectiveDate
1001	C123	Bangalore	Mumbai	2023-06-01

Advantages

Simple queries
Limited storage
Easy comparison

Limitations

Fixed history depth
Not scalable
Multiple changes limited

Surrogate Keys

Definition

System-generated unique identifiers that have no business meaning

Why Use Surrogate Keys?

Independence: Isolated from source system changes
Performance: Integer keys provide faster joins
History Tracking: Enable SCD Type 2 implementation
Integration: Merge data from multiple sources
Consistency: Uniform key structure across dimensions

Surrogate Keys - Example

Natural Key

CustomerID: "CUST-2024-001"
ProductCode: "PROD-ELC-2024"
OrderNumber: "ORD/2024/12345"
                    

Business meaning
Can change
Complex structure

Surrogate Key

CustomerKey: 1001
ProductKey: 5023
OrderKey: 789456
                    

No business meaning
Never changes
Simple integer

Fact tables reference dimensions using surrogate keys, ensuring data warehouse stability and performance.

Aggregates & Materialized Views

What are Aggregates?

Pre-calculated summary data stored to improve query performance

Purpose: Speed up analytical queries
Trade-off: Storage space vs. Query performance
Maintenance: Must be refreshed when base data changes
Transparency: Often invisible to end users

Queries that might take minutes on raw data can execute in seconds using aggregates

Materialized Views

Definition

Database objects containing query results stored as physical tables

CREATE MATERIALIZED VIEW MV_SalesByMonth AS
SELECT 
    d.Year,
    d.Month,
    p.Category,
    SUM(f.SalesAmount) AS TotalSales,
    COUNT(f.OrderKey) AS OrderCount
FROM FactSales f
JOIN DimDate d ON f.DateKey = d.DateKey
JOIN DimProduct p ON f.ProductKey = p.ProductKey
GROUP BY d.Year, d.Month, p.Category;
            

Physically stored aggregated results
Automatically refreshed (configurable)
Query rewrite optimization

Hierarchies in Data Warehousing

Definition

Logical structures organizing data from detailed to summarized levels

Common Hierarchies

Time Hierarchy

Year

↓

Quarter

↓

Month

↓

Day

Geography Hierarchy

Country

↓

State

↓

City

↓

Store

Rollups (Drill-Up)

Concept

Aggregating data to higher levels in a hierarchy

Daily Sales → Roll up to → Monthly Sales

Monthly Sales → Roll up to → Quarterly Sales

Quarterly Sales → Roll up to → Yearly Sales

Benefits

Reduces data volume at higher levels
Faster query performance for summary reports
Enables top-down analysis

Drill-Down Operations

Opposite of rollup - navigating from summary to detailed data

-- Example: Drilling down from Year to Month
SELECT 
    Year, Quarter, Month,
    SUM(SalesAmount) AS TotalSales
FROM FactSales f
JOIN DimDate d ON f.DateKey = d.DateKey
WHERE Year = 2024
GROUP BY Year, Quarter, Month
ORDER BY Year, Quarter, Month;
            

Enables root cause analysis
Provides detailed insights
Interactive business intelligence

Precomputed Aggregates

Strategy

Calculate and store aggregates during ETL process rather than query time

Types of Precomputed Aggregates

Summary Tables: Permanent aggregate tables
Aggregate Fact Tables: Fact tables at higher grain
OLAP Cubes: Multidimensional aggregates
Indexed Views: Database-managed aggregates

Designing Aggregate Tables

Best Practices

Identify Common Queries: Analyze query patterns
Select Appropriate Grain: Choose aggregation level
Balance Storage: Cost vs. performance benefit
Maintain Consistency: Synchronize with base tables

-- Example: Monthly Sales Aggregate Table
CREATE TABLE FactSalesMonthly AS
SELECT 
    d.YearMonth,
    p.CategoryKey,
    s.RegionKey,
    SUM(f.SalesAmount) AS TotalSales,
    SUM(f.Quantity) AS TotalQuantity,
    COUNT(*) AS OrderCount
FROM FactSales f
JOIN DimDate d ON f.DateKey = d.DateKey
JOIN DimProduct p ON f.ProductKey = p.ProductKey
JOIN DimStore s ON f.StoreKey = s.StoreKey
GROUP BY d.YearMonth, p.CategoryKey, s.RegionKey;
            

Aggregate Refresh Strategies

1. Complete Refresh

Drop and rebuild entire aggregate
Simple but resource-intensive
Used for small to medium aggregates

2. Incremental Refresh

Update only changed portions
More complex logic required
Efficient for large aggregates

3. On-Demand Refresh

Update when queried (lazy evaluation)
First query pays performance cost

Aggregate Navigation

Concept

Query optimizer automatically routes queries to appropriate aggregate tables

User Query → Query Optimizer → Select Best Aggregate

Transparent to end users - they query base tables, system uses aggregates

Benefits

Automatic query optimization
No application code changes needed
Consistent query results
Significant performance improvements

ETL Best Practices

Data Quality: Implement validation at each stage
Error Handling: Log errors and failed records
Incremental Loading: Use CDC for efficiency
Parallel Processing: Leverage multiple threads/processes
Idempotency: Ensure processes can be safely rerun
Monitoring: Track ETL job performance and failures
Documentation: Maintain clear data lineage

Performance Optimization Strategies

ETL Optimization

Bulk loading operations
Disable indexes during load
Partition large tables
Use staging areas
Parallel extraction

Query Optimization

Create appropriate indexes
Use aggregate tables
Implement partitioning
Materialized views
Query result caching

Real-World Example: Retail Analytics

Scenario

Large retail chain with 500+ stores analyzing daily sales

-- Base Fact Table: 10 million rows/day
FactSales (OrderKey, DateKey, StoreKey, ProductKey, 
           Quantity, SalesAmount, CostAmount)

-- Monthly Aggregate: Reduces to 50,000 rows/month
FactSalesMonthly (YearMonth, StoreKey, CategoryKey,
                  TotalSales, TotalCost, TotalQuantity)

-- Performance Improvement:
-- Query on base table: 45 seconds
-- Query on aggregate: 0.8 seconds
-- Speed improvement: 56x faster!
            

Aggregate Storage Hierarchy

Detailed Fact Table (Largest)

↓

Daily Aggregates

↓

Weekly Aggregates

↓

Monthly Aggregates

↓

Yearly Aggregates (Smallest)

Each level provides faster access for queries at that granularity

ETL & Data Warehousing Tools

ETL Tools

Informatica PowerCenter
Microsoft SSIS
Talend
Apache NiFi
AWS Glue
Azure Data Factory

Data Warehouse Platforms

Snowflake
Amazon Redshift
Google BigQuery
Microsoft Azure Synapse
Oracle Exadata
Teradata

Common Challenges & Solutions

Challenge	Solution
Data Quality Issues	Implement data profiling and validation rules
Long ETL Windows	Use incremental loads and parallel processing
Aggregate Maintenance	Automate refresh schedules, use incremental updates
Source System Changes	Implement change detection mechanisms
Storage Costs	Balance aggregate levels, implement data archiving

Key Takeaways

ETL is the foundation of data warehousing - Extract, Transform, Load
SCD Types manage dimensional changes over time (Type 2 most common)
Surrogate keys provide stability and enable history tracking
Aggregates dramatically improve query performance
Hierarchies enable drill-down and roll-up analysis
Precomputed aggregates trade storage for query speed
Proper design balances performance, storage, and maintenance

Discussion & Questions

Topics for Further Exploration

How would you design an ETL process for a multi-source environment?
When would you choose SCD Type 1 vs Type 2?
What factors influence aggregate table design?
How do cloud platforms change ETL strategies?
Real-time vs Batch ETL - trade-offs?

Questions?

Dr. Mohsin Dar | UPES

References & Further Reading

Kimball, R., & Ross, M. - "The Data Warehouse Toolkit"
Inmon, W. H. - "Building the Data Warehouse"
Rainardi, V. - "Building a Data Warehouse: With Examples in SQL Server"
Mundy, J., Thornthwaite, W. - "The Microsoft Data Warehouse Toolkit"
Oracle Documentation - "Data Warehousing Guide"
Microsoft Docs - "ETL Best Practices"

Thank You!

Dr. Mohsin Dar

Cloud & Software Operations Cluster | SOCS | UPES