Query Processing in Distributed Databases

Query Decomposition, Data Localization,
Distributed Query Processing,
Transaction Management & Concurrency Control

Database Systems | M.Tech First Semester

Prof. Mohsin F. Dar
Assistant Professor, Cloud & Software Operations Cluster
SOCS | UPES
2 / 15

Introduction to Distributed Databases

What is a Distributed Database?

A distributed database is a collection of multiple, logically interrelated databases distributed over a computer network, appearing as a single database to users.

Key Characteristics:

📊 Data Distribution

Data is stored across multiple sites

🔗 Network Connection

Sites connected via communication network

🎯 Transparency

Users see it as one logical database

⚡ Autonomy

Each site has local processing capability

3 / 15

Query Processing in Distributed Databases

Query Processing Layers

Query Decomposition
Data Localization
Global Optimization
Local Optimization

Key Objectives:

  • Minimize data transfer across the network
  • Maximize parallel execution at different sites
  • Reduce response time for queries
  • Optimize resource utilization across all nodes
4 / 15

Query Decomposition

Definition

Query decomposition transforms a high-level query (SQL) into an algebraic query on global relations, independent of data distribution.

Steps in Query Decomposition:

1 Normalization: Convert query to normalized form (eliminate redundancy)
2 Analysis: Detect and reject incorrect queries (type errors, semantic errors)
3 Simplification: Eliminate redundant predicates
4 Restructuring: Convert to relational algebra and optimize
5 / 15

Query Decomposition Example

Original SQL Query:

SELECT E.Name, D.DeptName FROM Employee E, Department D WHERE E.DeptID = D.DeptID AND E.Salary > 50000 AND D.Location = 'Mumbai'

Relational Algebra Expression:

π Name, DeptName ( σ Salary>50000 AND Location='Mumbai' ( Employee ⋈ DeptID Department ) )

Optimized (Push selections down):

π Name, DeptName ( (σ Salary>50000 (Employee)) ⋈ DeptID (σ Location='Mumbai' (Department)) )
6 / 15

Data Localization

What is Data Localization?

Data localization transforms global queries into fragment queries by determining which database fragments are involved in the query.

Types of Fragmentation:

Type Description Example
Horizontal Rows distributed based on conditions Employees by region
Vertical Columns distributed across sites Employee personal vs salary data
Hybrid Combination of both Regional + departmental split
💡 Key Concept: Localization reduces the search space by identifying only relevant fragments
7 / 15

Data Localization Example

Scenario: Employee Table Horizontally Fragmented

Fragment F1 (Site 1)

Employees where
Location = 'Mumbai'

Fragment F2 (Site 2)

Employees where
Location = 'Delhi'

Query:

SELECT * FROM Employee WHERE Location = 'Mumbai' AND Salary > 50000

Localized Query:

The query only needs to access Fragment F1 at Site 1, eliminating unnecessary network communication with Site 2.

SELECT * FROM F1 WHERE Salary > 50000 -- Only executed at Site 1
8 / 15

Distributed Query Processing

Cost Factors in Distributed Queries

  • Communication Cost: Data transfer across network (typically dominates)
  • Local Processing Cost: CPU and I/O at each site
  • Memory Cost: Buffer allocation and management

Query Optimization Strategies:

1 Reduce Data Transfer: Apply selections and projections early
2 Parallel Execution: Execute independent operations simultaneously
3 Replicate vs. Transfer: Decide whether to move data or replicate it
4 Join Strategy: Choose appropriate join method (nested loop, semi-join, bloom join)
9 / 15

Distributed Join Strategies

1. Ship Whole Relations

Transfer one entire relation to the site of the other relation.

Best when: One relation is small, network bandwidth is high

2. Semi-Join

Send only the joining attributes to reduce data transfer.

-- Step 1: Send projection of R1.JoinKey to Site 2 -- Step 2: Filter R2 using received keys -- Step 3: Send filtered R2 back to Site 1 -- Step 4: Perform final join at Site 1

Advantage: Reduces network traffic significantly

3. Bloom Join

Uses bit vectors (Bloom filters) to filter tuples before transfer.

Advantage: Even less data transfer than semi-join

10 / 15

Distributed Transaction Management

ACID Properties in Distributed Environment

⚛️ Atomicity

All or nothing execution across all sites

🔒 Consistency

Database integrity maintained globally

🔐 Isolation

Transactions don't interfere across sites

💾 Durability

Committed changes persist despite failures

Challenges in Distributed Transactions:

  • Multiple sites must coordinate commit/abort decisions
  • Network failures can partition the system
  • Site failures require recovery mechanisms
  • Ensuring global serializability is complex
11 / 15

Two-Phase Commit (2PC) Protocol

Purpose:

Ensures atomicity in distributed transactions - either all sites commit or all abort.

Phase 1: Prepare/Voting Phase

1 Coordinator sends PREPARE message to all participants
2 Each participant votes YES (ready to commit) or NO (abort)
3 Participants write to log and wait for decision

Phase 2: Commit/Abort Phase

4 If all votes YES: Coordinator sends COMMIT to all
5 If any vote NO: Coordinator sends ABORT to all
6 Participants execute decision and send acknowledgment
12 / 15

Distributed Concurrency Control

Why is it Complex?

Multiple transactions can access data at different sites simultaneously, requiring coordination to maintain consistency.

Concurrency Control Techniques:

Technique Approach Advantage
Locking Use distributed locks Prevents conflicts proactively
Timestamp Assign global timestamps No deadlocks
Optimistic Validate at commit time High concurrency for read-heavy workloads
Snapshot Isolation Each transaction sees a snapshot Readers don't block writers
13 / 15

Distributed Locking Protocols

1. Primary Copy Locking

One site designated as primary for each data item and handles all lock requests for that item.

Advantage: Simple implementation, easy to detect deadlocks

Disadvantage: Primary site becomes bottleneck

2. Distributed 2PL (Two-Phase Locking)

Transactions acquire locks in growing phase, release in shrinking phase across all sites.

⚠️ Deadlock Challenge: Distributed deadlocks are harder to detect because cycle may span multiple sites

Deadlock Handling Approaches:

  • Prevention: Use timestamp ordering or wound-wait schemes
  • Detection: Maintain global wait-for graph
  • Timeout: Abort transactions that wait too long
14 / 15

Timestamp-Based Concurrency Control

How it Works:

Each transaction T receives a unique timestamp TS(T) when it starts. System uses timestamps to ensure serializability.

Timestamp Ordering Rules:

Read Rule: If TS(T) < write_timestamp(X), reject read and rollback T
Write Rule: If TS(T) < read_timestamp(X) or TS(T) < write_timestamp(X), reject write and rollback T

Generating Global Timestamps:

  • Centralized: Single timestamp authority (simple but bottleneck)
  • Lamport Timestamps: Each site maintains counter with message passing rules
  • Vector Clocks: Track causality relationships between events
✅ Advantage: No deadlocks possible!
⚠️ Disadvantage: May cause more transaction rollbacks
15 / 15

Summary & Key Takeaways

Query Processing:

  • Query Decomposition: Transform SQL to optimized relational algebra
  • Data Localization: Identify relevant fragments to minimize data access
  • Distributed Optimization: Minimize network communication, maximize parallelism
  • Join Strategies: Semi-join and bloom join reduce data transfer

Transaction Management:

  • 2PC Protocol: Ensures atomicity across distributed sites
  • Concurrency Control: Multiple approaches (locking, timestamp, optimistic)
  • Challenges: Network failures, deadlocks, global serializability

The Golden Rule

"Minimize data movement, maximize parallel execution"

Thank You!
Questions & Discussion