ETL Processing & Aggregates

Dimensional Database Design - Unit V

Dr. Mohsin Dar

Assistant Professor

Cloud & Software Operations Cluster | SOCS

University of Petroleum and Energy Studies (UPES)

MTech - Database Systems | First Semester

Today's Agenda

What is ETL?

Extract, Transform, Load

ETL is the backbone of data warehousing, ensuring data quality and consistency for business intelligence and analytics.

ETL Workflow Architecture

Source Systems
Extract
Staging Area
Staging Area
Transform
Data Warehouse

ETL Workflow Steps

1. Extract Phase

2. Transform Phase

ETL Workflow - Load Phase

3. Load Phase Strategies

-- Example: Incremental Load Strategy INSERT INTO DimCustomer SELECT * FROM StagingCustomer WHERE LastModifiedDate > (SELECT MAX(LoadDate) FROM ETLControl);

Slowly Changing Dimensions (SCD)

Methods to handle changes in dimensional attributes over time

SCD Type 1: Overwrite

Characteristics

  • Simplest approach
  • No history maintained
  • Old values overwritten
  • Minimal storage

Use Cases

  • Correcting data errors
  • Non-significant changes
  • Current value analysis only
-- Example: Update customer address UPDATE DimCustomer SET City = 'New Delhi', State = 'Delhi' WHERE CustomerKey = 12345;

SCD Type 2: Add New Row

Most commonly used approach for maintaining complete history

CustomerKey CustomerID City StartDate EndDate IsCurrent
1001 C123 Mumbai 2020-01-01 2023-05-31 N
1002 C123 Bangalore 2023-06-01 9999-12-31 Y

SCD Type 3: Add New Column

Maintains limited history using additional columns

CustomerKey CustomerID CurrentCity PreviousCity EffectiveDate
1001 C123 Bangalore Mumbai 2023-06-01

Advantages

  • Simple queries
  • Limited storage
  • Easy comparison

Limitations

  • Fixed history depth
  • Not scalable
  • Multiple changes limited

Surrogate Keys

Definition

System-generated unique identifiers that have no business meaning

Why Use Surrogate Keys?

Surrogate Keys - Example

Natural Key

CustomerID: "CUST-2024-001" ProductCode: "PROD-ELC-2024" OrderNumber: "ORD/2024/12345"
  • Business meaning
  • Can change
  • Complex structure

Surrogate Key

CustomerKey: 1001 ProductKey: 5023 OrderKey: 789456
  • No business meaning
  • Never changes
  • Simple integer

Fact tables reference dimensions using surrogate keys, ensuring data warehouse stability and performance.

Aggregates & Materialized Views

What are Aggregates?

Pre-calculated summary data stored to improve query performance

Queries that might take minutes on raw data can execute in seconds using aggregates

Materialized Views

Definition

Database objects containing query results stored as physical tables

CREATE MATERIALIZED VIEW MV_SalesByMonth AS SELECT d.Year, d.Month, p.Category, SUM(f.SalesAmount) AS TotalSales, COUNT(f.OrderKey) AS OrderCount FROM FactSales f JOIN DimDate d ON f.DateKey = d.DateKey JOIN DimProduct p ON f.ProductKey = p.ProductKey GROUP BY d.Year, d.Month, p.Category;

Hierarchies in Data Warehousing

Definition

Logical structures organizing data from detailed to summarized levels

Common Hierarchies

Time Hierarchy

Year
Quarter
Month
Day

Geography Hierarchy

Country
State
City
Store

Rollups (Drill-Up)

Concept

Aggregating data to higher levels in a hierarchy

Daily Sales → Roll up to → Monthly Sales

Monthly Sales → Roll up to → Quarterly Sales

Quarterly Sales → Roll up to → Yearly Sales

Benefits

Drill-Down Operations

Opposite of rollup - navigating from summary to detailed data

-- Example: Drilling down from Year to Month SELECT Year, Quarter, Month, SUM(SalesAmount) AS TotalSales FROM FactSales f JOIN DimDate d ON f.DateKey = d.DateKey WHERE Year = 2024 GROUP BY Year, Quarter, Month ORDER BY Year, Quarter, Month;

Precomputed Aggregates

Strategy

Calculate and store aggregates during ETL process rather than query time

Types of Precomputed Aggregates

Designing Aggregate Tables

Best Practices

-- Example: Monthly Sales Aggregate Table CREATE TABLE FactSalesMonthly AS SELECT d.YearMonth, p.CategoryKey, s.RegionKey, SUM(f.SalesAmount) AS TotalSales, SUM(f.Quantity) AS TotalQuantity, COUNT(*) AS OrderCount FROM FactSales f JOIN DimDate d ON f.DateKey = d.DateKey JOIN DimProduct p ON f.ProductKey = p.ProductKey JOIN DimStore s ON f.StoreKey = s.StoreKey GROUP BY d.YearMonth, p.CategoryKey, s.RegionKey;

Aggregate Refresh Strategies

1. Complete Refresh

2. Incremental Refresh

3. On-Demand Refresh

Aggregate Navigation

Concept

Query optimizer automatically routes queries to appropriate aggregate tables

User Query → Query Optimizer → Select Best Aggregate

Transparent to end users - they query base tables, system uses aggregates

Benefits

ETL Best Practices

Performance Optimization Strategies

ETL Optimization

  • Bulk loading operations
  • Disable indexes during load
  • Partition large tables
  • Use staging areas
  • Parallel extraction

Query Optimization

  • Create appropriate indexes
  • Use aggregate tables
  • Implement partitioning
  • Materialized views
  • Query result caching

Real-World Example: Retail Analytics

Scenario

Large retail chain with 500+ stores analyzing daily sales

-- Base Fact Table: 10 million rows/day FactSales (OrderKey, DateKey, StoreKey, ProductKey, Quantity, SalesAmount, CostAmount) -- Monthly Aggregate: Reduces to 50,000 rows/month FactSalesMonthly (YearMonth, StoreKey, CategoryKey, TotalSales, TotalCost, TotalQuantity) -- Performance Improvement: -- Query on base table: 45 seconds -- Query on aggregate: 0.8 seconds -- Speed improvement: 56x faster!

Aggregate Storage Hierarchy

Detailed Fact Table (Largest)
Daily Aggregates
Weekly Aggregates
Monthly Aggregates
Yearly Aggregates (Smallest)

Each level provides faster access for queries at that granularity

ETL & Data Warehousing Tools

ETL Tools

  • Informatica PowerCenter
  • Microsoft SSIS
  • Talend
  • Apache NiFi
  • AWS Glue
  • Azure Data Factory

Data Warehouse Platforms

  • Snowflake
  • Amazon Redshift
  • Google BigQuery
  • Microsoft Azure Synapse
  • Oracle Exadata
  • Teradata

Common Challenges & Solutions

Challenge Solution
Data Quality Issues Implement data profiling and validation rules
Long ETL Windows Use incremental loads and parallel processing
Aggregate Maintenance Automate refresh schedules, use incremental updates
Source System Changes Implement change detection mechanisms
Storage Costs Balance aggregate levels, implement data archiving

Key Takeaways

Discussion & Questions

Topics for Further Exploration

Questions?

Dr. Mohsin Dar | UPES

References & Further Reading

Thank You!

Dr. Mohsin Dar

Cloud & Software Operations Cluster | SOCS | UPES

1 / 30