Data Fragmentation in Distributed Databases

Understanding Data Distribution Strategies

Instructor: Mohsin F. Dar

Assistant Professor, Cloud & Software Operations Cluster | SOCS | UPES

Course: Database Systems | M.Tech. First Semester

Press Next to Begin →

Data Fragmentation: Core Concept

What is Data Fragmentation?

Data fragmentation is the process of dividing a relation into smaller fragments and storing these fragments at different sites in a distributed database system.

Objectives of Fragmentation:

1. Locality of Reference: Data is stored close to where it's most frequently used

2. Improved Performance: Parallel query execution on fragments

3. Reduced Communication Costs: Minimize data transfer across network

4. Enhanced Availability: Failure of one site doesn't affect all data

5. Load Balancing: Distribute workload across multiple sites

Fundamental Rule:

Fragments must be COMPLETE and RECONSTRUCTIBLE

• Completeness: All data in the original relation must appear in at least one fragment

• Reconstruction: Must be able to recreate the original relation from fragments

• Disjointness: (Optional) Data items should appear in only one fragment

Types of Data Fragmentation

1. Horizontal Fragmentation

Divides relation by rows (tuples)

Each fragment contains a subset of rows

2. Vertical Fragmentation

Divides relation by columns (attributes)

Each fragment contains a subset of columns

3. Mixed (Hybrid) Fragmentation

Combination of horizontal and vertical

Most flexible approach

Choosing the Right Fragmentation Strategy:

Consider:

• Application access patterns

• Query frequency and types

• Data distribution requirements

• Network bandwidth and latency

• Storage capacity at each site

Horizontal vs Vertical vs Hybrid

Compare & Contrast (with examples)

Type	Split By	Reconstruction	Typical Use	Example
Horizontal	Rows (tuples)	UNION (∪)	Geography / campus / region-based access	Employee split by Location
Vertical	Columns (attributes)	JOIN (⋈) on PK	Different apps use different columns; security separation	Employee split into (EmpID,Name,Dept) and (EmpID,Salary)
Hybrid	Rows + Columns	JOIN + UNION (sequence dependent)	Complex workloads: locality + security + performance	Vertical first, then horizontal on a subset

Key idea to remember

Horizontal keep all attributes, split tuples using predicates.

Vertical keep PK in every fragment, split non-key attributes by usage affinity.

Hybrid use when you need both locality (horizontal) and attribute grouping/security (vertical).

Case Study: University Distributed DB

STUDENT table and fragmentation across 3 sites

Sites: Site A (Main Campus), Site B (Engineering Campus), Site C (Medical Campus)

Table: STUDENT(student_id, name, department, gpa, campus_location, enrollment_year, email)

Recommended fragmentation (Primary Horizontal by campus)

STUDENT_A = σ_{campus_location='Main Campus'}(STUDENT) → stored at Site A

STUDENT_B = σ_{campus_location='Engineering Campus'}(STUDENT) → stored at Site B

STUDENT_C = σ_{campus_location='Medical Campus'}(STUDENT) → stored at Site C

Reconstruction: STUDENT = STUDENT_A ∪ STUDENT_B ∪ STUDENT_C

Allocation map

Fragment	Predicate	Stored at	Reason
STUDENT_A	campus_location='Main Campus'	Site A	Max locality for Main Campus operations
STUDENT_B	campus_location='Engineering Campus'	Site B	Engineering admin and apps are local
STUDENT_C	campus_location='Medical Campus'	Site C	Medical campus workloads stay local

Case Study: Reconstruction & Constraint Notes

How queries and constraints behave

Reconstruction (Global View)

STUDENT = STUDENT_A ∪ STUDENT_B ∪ STUDENT_C

A global query like "count students by department" can execute in parallel at each site and then merge results.

Primary Key (student_id) handling

Goal: student_id must be unique across all sites.

Common solutions:

• Central/Global ID generator (or campus-prefixed IDs like A-1001, B-2001, C-3001)

• Distributed unique ID strategies (UUID/ULID) to avoid collisions

Foreign Key idea (derived fragmentation)

If a table ENROLLMENT(student_id, course_id, ...) references STUDENT(student_id), then fragment ENROLLMENT based on STUDENT fragments:

ENROLL_A joins with STUDENT_A, stored at Site A (local FK checks)

ENROLL_B joins with STUDENT_B, stored at Site B

ENROLL_C joins with STUDENT_C, stored at Site C

What to mention in exams

• Use derived horizontal fragmentation to co-locate related tuples (helps FK enforcement)

• Global constraints may require coordination (e.g., distributed commit) if updates span sites

Horizontal Fragmentation

Concept & Mechanism

Horizontal fragmentation divides a relation into subsets of tuples (rows). Each fragment contains all columns but only selected rows based on a predicate.

Example: Employee Table

EmpID	Name	Department	Salary	Location
101	John Smith	Sales	50000	Delhi
102	Sarah Jones	Engineering	70000	Mumbai
103	Mike Brown	Sales	55000	Delhi
104	Lisa White	HR	60000	Mumbai

Fragmentation by Location:

Fragment 1 (Delhi Site): σ_{Location='Delhi'}(Employee)

Fragment 2 (Mumbai Site): σ_{Location='Mumbai'}(Employee)

Reconstruction:

Employee = Fragment1 ∪ Fragment2

Using UNION operation to combine all fragments

Types of Horizontal Fragmentation

1. Primary Horizontal Fragmentation

Based on predicates defined on the relation itself

Example: Fragmenting Employee table by Location attribute

F1 = σ_{Location='Delhi'}(Employee)
F2 = σ_{Location='Mumbai'}(Employee)

2. Derived Horizontal Fragmentation

Based on predicates defined on another relation (partitioning based on a foreign key relationship)

Example: Fragment Project table based on Employee fragmentation

If Employee is fragmented by Location,
Project can be fragmented based on Manager's Location (semi-join)

Advantages of Horizontal Fragmentation:

✓ Supports parallel processing effectively

✓ Simple queries access only relevant fragments

✓ Natural partitioning for geographic distribution

✓ Improved performance for selection queries

Vertical Fragmentation

Concept & Mechanism

Vertical fragmentation divides a relation by columns (attributes). Each fragment contains a subset of attributes, but the primary key must be replicated in all fragments for reconstruction.

Example: Employee Table Vertical Fragmentation

Original Relation: Employee(EmpID, Name, Dept, Salary, Address, Phone)

Fragment 1 (Basic Info)

EmpID	Name	Dept
101	John	Sales
102	Sarah	Eng

Fragment 2 (Financial)

EmpID	Salary
101	50000
102	70000

Fragment 3 (Contact)

EmpID	Address	Phone
101	Delhi	99999
102	Mumbai	88888

Reconstruction:

Employee = Fragment1 ⋈_EmpID Fragment2 ⋈_EmpID Fragment3

Using NATURAL JOIN operation on primary key

Vertical Fragmentation: Deep Dive

Key Principles:

1. Primary Key Replication: Every fragment must include the primary key

2. Completeness: Union of all attributes in fragments = all attributes in original relation

3. Disjointness: Non-key attributes appear in exactly one fragment

Maintaining Integrity Constraints across Vertical Fragments

Primary Key (PK): The PK attribute(s) are present in every vertical fragment, but uniqueness must still be enforced globally.

• Enforce inserts/updates using distributed transactions so all fragments are updated atomically (e.g., 2PC at commit).

• Use a global ID strategy to avoid PK collisions (central ID generator, UUID/ULID, or site-prefixed keys).

Foreign Key (FK): A referencing table typically points to the PK of the referenced table.

Common strategies in a DDBMS:

• Co-location / derived fragmentation: keep referencing tuples at the same site as the referenced PK fragment to allow local FK checks.

• Replicate key columns: keep a small “key directory” (PK values) where needed for validation.

• Distributed referential integrity checks: validate at commit time using distributed messaging/2PC (costly but correct).

Mini Example

STUDENT_PUBLIC(student_id, name, department, campus_location)

STUDENT_PRIVATE(student_id, gpa, email)

ENROLLMENT(student_id, course_id, semester) references STUDENT_PUBLIC(student_id). If STUDENT_PUBLIC is co-located/replicated appropriately, FK checks can be local.

When to Use Vertical Fragmentation?

✓ Different applications access different attributes

✓ Security requirements (separate sensitive data)

✓ Performance optimization (frequently accessed attributes together)

✓ Reduce I/O for queries needing only few attributes

Design Considerations:

Attribute Affinity: Group attributes accessed together

Access Frequency: Consider query patterns and frequency

Join Cost: Balance fragmentation benefits vs reconstruction overhead

Challenges:

✗ Requires join operations for reconstruction

✗ Complex query processing for multi-attribute queries

✗ Increased storage due to key replication

Mixed (Hybrid) Fragmentation

Combining Horizontal and Vertical Fragmentation

Mixed fragmentation applies horizontal and vertical fragmentation in sequence, providing the most flexible data distribution strategy.

Two Approaches:

Vertical-then-Horizontal

1. First apply vertical fragmentation

2. Then apply horizontal fragmentation to vertical fragments

Horizontal-then-Vertical

1. First apply horizontal fragmentation

2. Then apply vertical fragmentation to horizontal fragments

Example Process:

Step 1: Vertical fragmentation of Employee table

E1 = π_{EmpID, Name, Dept}(Employee)

E2 = π_{EmpID, Salary, Location}(Employee)

Step 2: Horizontal fragmentation of E2

E2_Delhi = σ_{Location='Delhi'}(E2)

E2_Mumbai = σ_{Location='Mumbai'}(E2)

Benefits:

✓ Maximum flexibility in data distribution

✓ Optimizes both query performance and data locality

✓ Supports complex application requirements

Interactive Fragmentation Demo

Original Employee Table:

EmpID	Name	Dept	Salary	City
1	Alice	HR	60000	Delhi
2	Bob	IT	75000	Mumbai
3	Charlie	Sales	55000	Delhi
4	Diana	IT	80000	Mumbai

Try Different Fragmentations:

Select a fragmentation type to see the results...

Fragmentation Rules & Correctness

Three Fundamental Rules:

1. Completeness Rule

If a relation R is decomposed into fragments R1, R2, ..., Rn, each data item in R must appear in at least one fragment Ri.

∀ tuple t ∈ R, ∃ i: t ∈ Ri

2. Reconstruction Rule

It must be possible to reconstruct the original relation R from its fragments.

Horizontal: R = R1 ∪ R2 ∪ ... ∪ Rn (UNION)

Vertical: R = R1 ⋈ R2 ⋈ ... ⋈ Rn (JOIN)

3. Disjointness Rule

If a data item appears in fragment Ri, it should not appear in any other fragment Rj (i ≠ j).

Exception: Primary keys in vertical fragmentation (needed for reconstruction)

Horizontal: Ri ∩ Rj = ∅ for all i ≠ j

Vertical: Ri ∩ Rj = Primary Key attributes only

Fragmentation: Pros & Cons

Advantages:

1. Improved Performance: Parallel query execution and reduced data access time

2. Data Locality: Data stored close to where it's used most frequently

3. Reduced Communication Cost: Less data transfer across network

4. Enhanced Availability: System continues despite site failures

5. Scalability: Easy to add new sites and redistribute data

6. Security: Sensitive data can be isolated at secure sites

7. Load Distribution: Better resource utilization across sites

Disadvantages:

1. Complexity: More complex query processing and optimization

2. Semantic Integrity: Harder to enforce constraints across fragments

3. Join Overhead: Reconstruction requires expensive operations

4. Design Difficulty: Requires careful analysis of access patterns

5. Maintenance Cost: Updates may affect multiple fragments

6. Storage Overhead: Key replication in vertical fragmentation

Fragmentation Design Process

Step-by-Step Methodology:

Step 1: Requirements Analysis

• Identify applications and their data access patterns

• Analyze query frequency and types

• Determine update patterns

• Consider geographic distribution of users

Step 2: Quantitative Analysis

• Create attribute usage matrix

• Calculate attribute affinity (for vertical fragmentation)

• Analyze predicate usage (for horizontal fragmentation)

• Estimate access frequencies

Step 3: Fragmentation Schema Design

• Choose fragmentation type(s)

• Define fragmentation predicates

• Define attribute groups (for vertical fragmentation)

• Decide allocation: which site stores which fragment

• Decide replication: which fragments (or key directories) should be copied

• Validate correctness: completeness, reconstruction, disjointness

Step 4: Implementation & Monitoring

• Implement fragments, global schema mappings, and access rules

• Measure query latency, network cost, and update overhead

• Re-tune fragmentation if access patterns change

Summary

Key Takeaways

✓ Fragmentation improves locality, parallelism, and scalability

✓ Horizontal fragmentation reconstructs with UNION

✓ Vertical fragmentation reconstructs with JOIN and requires PK replication

✓ FK enforcement is easier with co-location / derived fragmentation

✓ University example: fragment STUDENT by campus_location across Site A/B/C

Thank You!