Introduction to Data Warehousing
Database Systems - MTech First Semester

Topics Covered

  • 📊 OLTP vs OLAP Systems
  • 🎯 Characteristics of Data Warehousing
  • 💡 Need for Data Warehousing
  • 🏗️ Warehouse Architecture (2-tier & 3-tier)
  • 📦 Staging Area & Data Marts
  • 📋 Metadata Management
Dr. Mohsin Dar
Assistant Professor, Cloud & Software Operations Cluster
UPES University
What is Data Warehousing?

Definition

A Data Warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process.

Key Characteristics

  • Subject-Oriented: Organized around major subjects (customers, products, sales) rather than applications
  • Integrated: Data from multiple heterogeneous sources is consolidated and made consistent
  • Time-Variant: Historical data is maintained to enable trend analysis and forecasting
  • Non-Volatile: Data is stable and doesn't change once entered; only loaded and accessed

Purpose: Enable business intelligence, reporting, and advanced analytics for strategic decision-making

OLTP vs OLAP
Understanding the Fundamental Difference
Aspect OLTP (Online Transaction Processing) OLAP (Online Analytical Processing)
Purpose Day-to-day operations and transactions Analysis, reporting, and decision support
Data Source Operational data (current) Consolidated data from multiple sources
Query Type Simple, standardized queries Complex queries with aggregations
Operations INSERT, UPDATE, DELETE SELECT (mostly read-only)
Response Time Milliseconds Seconds to minutes
Users Thousands (operational staff) Hundreds (analysts, managers)
OLTP vs OLAP (Continued)
Aspect OLTP OLAP
Data Volume Gigabytes (GB) Terabytes (TB) to Petabytes (PB)
Database Design Normalized (3NF, BCNF) Denormalized (Star/Snowflake schema)
History Current data (60-90 days) Historical data (years)
Examples Banking, E-commerce, Inventory Business Intelligence, Data Mining, Forecasting
Backup Regular backups critical Periodic loading and refreshing
Key Insight: OLTP focuses on transaction speed and data integrity, while OLAP focuses on query performance and analytical capabilities.
Need for Data Warehousing

Why Organizations Need Data Warehouses

Business Challenges

  • Data scattered across multiple systems
  • Inconsistent data formats
  • Poor data quality and redundancy
  • Lack of historical perspective

Decision-Making Needs

  • Strategic planning requires insights
  • Trend analysis and forecasting
  • Customer behavior analysis
  • Performance monitoring

Key Benefits

  • Consolidated View: Single source of truth for the entire organization
  • Improved Query Performance: Optimized for complex analytical queries
  • Historical Intelligence: Maintain years of data for trend analysis
  • Data Quality: Cleansed, validated, and standardized data
  • Enhanced Decision Making: Better insights lead to informed decisions
  • Separation of Concerns: Analytics don't impact operational systems
Data Warehouse Architecture
Building Blocks of a Data Warehouse

Core Components

  • Data Sources: Operational databases, external data, flat files, APIs
  • ETL Process: Extract, Transform, Load pipeline
  • Data Storage: Central repository with optimized schema
  • Metadata Repository: Data about data
  • Access Tools: Reporting, querying, and analytics interfaces

Architecture Types

Data warehouses can be implemented in different architectural patterns:

  • 2-Tier Architecture: Simplified structure with minimal layers
  • 3-Tier Architecture: Standard industry approach with clear separation

Next slides will explore each architecture in detail...

2-Tier Architecture

Tier 1: Data Sources

Operational Databases
External Sources
Flat Files

Tier 2: Data Warehouse

ETL Process
Data Storage
Query & Analysis Tools

Characteristics

  • Direct Connection: Source systems directly connected to warehouse
  • Simplified Structure: Fewer layers and components
  • Faster Implementation: Quicker to set up and deploy

Advantages

  • Lower initial cost and complexity
  • Easier to maintain with fewer moving parts
  • Suitable for small to medium organizations

Disadvantages

  • Limited scalability for large enterprises
  • Performance issues with complex transformations
  • Less flexibility in data processing
3-Tier Architecture
Industry Standard Approach

Bottom Tier

Data Sources
ETL Tools
Data Acquisition

Middle Tier

Data Warehouse Server
OLAP Server
Data Storage

Top Tier

Client Tools
Reporting
Analytics & BI

Data Warehouse 3-Tier Architecture

Figure: 3-Tier Data Warehouse Architecture

Tier Descriptions

  • Bottom Tier (Data Sources): Consists of data acquisition from various sources, ETL processes, and staging area
  • Middle Tier (Data Warehouse Server): Central repository, OLAP engine, and metadata management
  • Top Tier (Client Layer): Front-end tools for querying, reporting, and data visualization
3-Tier Architecture - Benefits

Advantages

  • Scalability: Each tier can be scaled independently based on needs
  • Performance: OLAP server optimizes query performance
  • Flexibility: Can support multiple data sources and client tools
  • Maintainability: Clear separation of concerns makes updates easier
  • Security: Better security implementation at each layer
  • Platform Independence: Different technologies at different tiers

Disadvantages

  • Higher initial cost and complexity
  • Requires more skilled personnel to manage
  • Longer implementation time

Industry Standard: 3-tier architecture is the most commonly used approach in enterprise data warehousing due to its robustness and scalability.

Staging Area

Definition

A Staging Area is an intermediate storage area between data sources and the data warehouse where data is temporarily held for transformation and quality checks.

Purpose and Functions

  • Data Consolidation: Collect data from multiple heterogeneous sources
  • Data Transformation: Clean, validate, and transform data before loading
  • Data Quality Assurance: Identify and fix data quality issues
  • Change Detection: Identify new, modified, or deleted records
  • Error Handling: Isolate problematic records for review

Key Characteristics

  • Temporary storage - not for querying or reporting
  • Can be cleared after successful load to warehouse
  • May contain raw and transformed versions of data
  • Supports incremental and full data loads

Best Practice: A well-designed staging area improves ETL efficiency and data quality significantly.

Data Marts

Definition

A Data Mart is a subset of a data warehouse focused on a specific business line, department, or subject area.

Types of Data Marts

Dependent Data Marts

  • Created from central data warehouse
  • Ensures consistency
  • Top-down approach

Independent Data Marts

  • Standalone systems
  • Direct from source systems
  • Bottom-up approach

Benefits of Data Marts

  • Focused Data: Contains only relevant data for specific users
  • Improved Performance: Smaller size leads to faster queries
  • Cost-Effective: Less expensive to implement than full warehouse
  • User-Specific: Customized to departmental needs
  • Quick Deployment: Faster to implement and deliver value

Common Examples

  • Sales Data Mart, Marketing Data Mart, Finance Data Mart, HR Data Mart
Metadata in Data Warehousing

Definition

Metadata is "data about data" - information that describes the data stored in the warehouse, its structure, meaning, and usage.

Types of Metadata

1. Technical Metadata

  • Database schemas, table structures, data types
  • Source system details and mappings
  • ETL job definitions and schedules
  • Data transformation rules and logic

2. Business Metadata

  • Business definitions and terminology
  • Data ownership and stewardship information
  • Business rules and validation criteria
  • Report definitions and KPI calculations

3. Operational Metadata

  • Data lineage and audit trails
  • Load statistics and job execution history
  • Data quality metrics and error logs
  • Usage statistics and access patterns
Importance of Metadata

Why Metadata is Critical

  • Data Discovery: Helps users find and understand available data
  • Data Lineage: Tracks data from source to destination
  • Impact Analysis: Understand effects of changes to data or processes
  • Data Governance: Supports compliance and regulatory requirements
  • ETL Management: Documents transformation logic and dependencies
  • Quality Assurance: Helps identify and resolve data quality issues

Metadata Management System

A centralized repository that stores, manages, and provides access to all metadata. It serves as a catalog for the data warehouse.

Key Components

  • Metadata repository database
  • Metadata extraction and loading tools
  • Metadata query and reporting interfaces
  • Version control and change management

Best Practice: Invest time in comprehensive metadata management - it pays dividends in warehouse usability and maintenance.

Summary

Key Takeaways

Data Warehousing Fundamentals

  • Data warehouses are designed for analytics, not transactions
  • OLTP handles operations; OLAP handles analysis
  • Subject-oriented, integrated, time-variant, and non-volatile

Architecture Choices

  • 2-tier: Simpler, suitable for smaller implementations
  • 3-tier: Industry standard, scalable and flexible
  • Staging area crucial for data quality and transformation

Data Organization

  • Data marts provide focused, departmental views
  • Metadata is essential for managing and understanding data
  • Proper architecture enables better decision-making

Questions?

References & Further Reading

Recommended Resources

  • Books:
    • "The Data Warehouse Toolkit" by Ralph Kimball
    • "Building the Data Warehouse" by W.H. Inmon
    • "Data Warehouse Design: Modern Principles and Methodologies" by Golfarelli & Rizzi
  • Key Concepts to Explore:
    • Star Schema and Snowflake Schema Design
    • Dimensional Modeling Techniques
    • ETL Best Practices and Tools
    • Data Quality Management
    • Real-time Data Warehousing

Thank You!

Dr. Mohsin Dar
Assistant Professor
Cloud & Software Operations Cluster | SOCS
UPES University

1 / 15