Distributed Databases

Database Systems Course

M.Tech - First Semester

Mohsin F. Dar

Assistant Professor

Cloud & Software Operations Cluster | SOCS

University of Petroleum and Energy Studies (UPES)

What is a Distributed Database?

A Distributed Database is a collection of multiple, logically interrelated databases distributed over a computer network.

Key Characteristics

Data is physically distributed across multiple sites

Sites are connected via a communication network

Each site can process local applications autonomously

Each site participates in at least one global application

1) Physical distribution of data (multiple sites)

Data is stored at different geographic locations (sites/nodes). A relation may exist fully at one site, or be fragmented and stored across sites, or replicated at multiple sites.

Example: A bank stores Customer data at HQ, while each branch stores local Accounts and Transactions for faster access.

2) Network connectivity (communication links)

Sites communicate over LAN/WAN to exchange data and coordinate queries/transactions. Network quality affects response time and reliability.

Example: An airline reservation system connects airport sites so seat availability updates propagate across regions.

3) Local autonomy (local control + local processing)

Each site can execute local queries/transactions and enforce local policies (e.g., access rules) even while participating in global operations.

Example: A university’s multi-campus setup lets each campus manage lab inventories locally while the university runs centralized analytics.

4) Global applications (single logical database)

Users/applications see the distributed database as one logical database. A global query may access multiple sites, combine partial results, and return one answer.

Example: An e-commerce “track order” query may join warehouse inventory + shipping updates from different sites.

Figure: One logical database, many sites (high-level view)

A distributed database is spread across sites, but presented as one logical database. Query processing and transaction control coordinate work across the network.

📊 Distribution Transparency

Users can access data without knowing its physical location

Types of Transparency:

• Location transparency: user queries a table without knowing which site stores it

• Replication transparency: multiple copies exist, but updates/reads are handled automatically

• Fragmentation transparency: data is split (horizontal/vertical) but appears as one logical table

Example (how transparency feels to the user)

User Query SELECT * FROM Student WHERE RollNo = 101;

Reality Student records may be horizontally fragmented by campus (Dehradun/Delhi/Hyderabad), and a hot subset may be replicated at two sites for availability.

The DDBMS uses the global dictionary to find the right fragment/replica and returns a single answer.

Need for Distributed Database Management System

🌍 Organizational Structure

Match database distribution to organizational structure

Modern organizations are geographically distributed across cities, countries, or continents. A centralized database does not reflect this structure and often leads to inefficiencies. A Distributed DBMS allows data to be stored and managed at locations where business activities actually occur, while still presenting a unified database view.

Example: A retail chain has branches in Delhi, Mumbai, and Bangalore. Each branch maintains local sales and inventory data, while the head office accesses consolidated data for strategic decisions.

Why needed: Aligns IT infrastructure with real-world organizational workflows.

📈 Improved Performance

Data is closer to the point of use

Performance improves when data is stored near the users who access it most frequently. Local access reduces communication delays, network congestion, and response time. Distributed DBMS also enables parallel execution of queries across multiple sites.

Example: Customer records frequently accessed by a regional support team are stored locally, reducing dependence on long-distance network connections.

Key idea: Local queries run locally; global queries execute in parallel.

💰 Economics

Cost-effective compared to centralized systems

Distributed systems are economically attractive because they rely on multiple low-cost servers instead of one high-end centralized machine. Organizations can reuse existing hardware and gradually expand capacity.

Example: Instead of upgrading a single expensive data center server, an organization deploys smaller servers at branch offices.

Advantage: Lower initial cost, incremental investment, and better return on infrastructure.

⚡ Increased Reliability

No single point of failure

In centralized databases, a single failure can bring the entire system down. Distributed DBMS improves reliability by replicating data across multiple sites. Even if one site fails, others can continue to operate.

Example: If a database server in one city fails due to a power outage, users are automatically redirected to another replica.

Important note: Reliability improves availability, but consistency must be carefully managed.

🔄 Scalability

Easy expansion by adding new sites

Distributed DBMS supports horizontal scalability, where capacity is increased by adding new nodes rather than upgrading existing ones. This makes it suitable for growing organizations and applications with increasing data volume.

Example: When a hospital chain opens a new branch, a new database site is added without affecting existing sites.

Result: Storage and processing power grow linearly with demand.

🔐 Local Autonomy

Sites maintain control over local data

Each site in a distributed system can enforce its own security rules, policies, and operational procedures. This is essential when different departments or regions have legal or operational independence.

Example: A university campus controls its student records locally while sharing summary data for university-wide analysis.

Benefit: Independence at local sites with coordination at the global level.

Advantages vs Disadvantages

Advantages	Disadvantages
✓ Reflects organizational structure naturally	✗ Increased complexity in design and management
✓ Improved shareability and local autonomy	✗ Security and integrity control more difficult
✓ Better availability and reliability	✗ Lack of experienced personnel and standards
✓ Enhanced performance through parallelism	✗ Database design more complex
✓ Economics of scale with smaller systems	✗ Higher operational and maintenance costs
✓ Modular growth and scalability	✗ Additional hardware and network costs

Distributed DBMS Architecture - Components

Core Components

Computer Workstations (Sites/Nodes)

Communication Network (LAN/WAN)

Local Database Management System

Global Data Dictionary

Distributed DBMS Software

🖥️ Site/Node (Click to expand)

A location where data is stored and processing occurs. Each site has its own local database and DBMS, capable of autonomous operation while participating in the distributed system.

🌐 Communication Network (Click to expand)

Connects all sites enabling data exchange. Can be LAN (Local Area Network) for nearby sites or WAN (Wide Area Network) for geographically dispersed locations. Network reliability is crucial for system performance.

📚 Global Data Dictionary (Click to expand)

Maintains metadata about the entire distributed database including data location, fragmentation, and replication information. Essential for transparency and query processing.

Types of Distributed DBMS Architecture

1. Client-Server Architecture

Separation of client processes and server processes

Two-Tier: Clients directly connect to database servers
Three-Tier: Application server layer between clients and database
Benefits: Clear separation of concerns, centralized data management

2. Peer-to-Peer Architecture

All nodes have equal status and capabilities

Characteristics: No central coordinator, each node can act as client or server
Use Cases: Highly distributed environments, blockchain systems
Benefits: High autonomy, no single point of failure

3. Multi-DBMS Architecture

Integration of multiple heterogeneous databases

Federated: Loose coupling with local autonomy preserved
Characteristics: Different DBMS products integrated
Challenges: Schema integration, query translation

Data Distribution Strategies

📑 Fragmentation

Dividing database into smaller pieces

Horizontal Fragmentation: Rows distributed across sites (e.g., customers by region)
Vertical Fragmentation: Columns distributed (e.g., separate sensitive data)
Mixed/Hybrid: Combination of horizontal and vertical
Benefits: Improved performance, security, availability

🔄 Replication

Storing copies of data at multiple sites

Full Replication: Complete database copy at each site
Partial Replication: Selected fragments replicated
No Replication: Each fragment at single site
Trade-offs: Availability vs consistency vs storage cost

🎯 Allocation

Deciding where to place data fragments

Factors Considered:
• Access frequency from each site
• Storage capacity and cost
• Network communication costs
• Local processing requirements
• Reliability and availability requirements

Distributed Query Processing

Query Processing Steps

1. Query Decomposition

↓

2. Data Localization

↓

3. Global Query Optimization

↓

4. Local Query Optimization

↓

5. Query Execution

Key Challenges

Cost of data transfer over the network

Lack of global statistics about data

Complex optimization algorithms

Balancing local vs global optimization

Distributed Transaction Management

🔒 Concurrency Control

Managing concurrent access to distributed data

Approaches:
• Distributed Locking (2PL)
• Timestamp Ordering
• Optimistic Concurrency Control
Challenge: Maintaining serializability across sites

✅ Commit Protocols

Ensuring atomicity of distributed transactions

Two-Phase Commit (2PC):
Phase 1: Prepare - All sites ready to commit
Phase 2: Commit/Abort - Final decision applied
Three-Phase Commit (3PC): Non-blocking variant

🔄 Recovery

Handling failures in distributed environment

Types of Failures:
• Site failures
• Communication link failures
• Transaction failures
Solutions: Logging, checkpointing, backup sites

Summary

Key Takeaways

Distributed databases spread data across multiple interconnected sites

Primary benefits: improved performance, reliability, scalability, and alignment with organizational structure

Architecture components: sites, network, local DBMS, global dictionary, distributed DBMS software

Distribution strategies: fragmentation, replication, and allocation

Key challenges: complexity, query optimization, transaction management, and consistency

Real-World Applications

✓ Banking systems across branches

✓ Airline reservation systems

✓ E-commerce platforms

✓ Social media networks

✓ Content delivery networks

Thank You!

Questions and Discussion