Data Fragmentation in Distributed Databases

Understanding Data Distribution Strategies

Instructor: Mohsin F. Dar

Assistant Professor, Cloud & Software Operations Cluster | SOCS | UPES

Course: Database Systems | M.Tech. First Semester

Press Next to Begin →

Data Fragmentation: Core Concept

What is Data Fragmentation?

Data fragmentation is the process of dividing a relation into smaller fragments and storing these fragments at different sites in a distributed database system.

Objectives of Fragmentation:

1. Locality of Reference: Data is stored close to where it's most frequently used

2. Improved Performance: Parallel query execution on fragments

3. Reduced Communication Costs: Minimize data transfer across network

4. Enhanced Availability: Failure of one site doesn't affect all data

5. Load Balancing: Distribute workload across multiple sites

Fundamental Rule:

Fragments must be COMPLETE and RECONSTRUCTIBLE

Completeness: All data in the original relation must appear in at least one fragment

Reconstruction: Must be able to recreate the original relation from fragments

Disjointness: (Optional) Data items should appear in only one fragment

Types of Data Fragmentation

1. Horizontal Fragmentation

Divides relation by rows (tuples)

Each fragment contains a subset of rows

2. Vertical Fragmentation

Divides relation by columns (attributes)

Each fragment contains a subset of columns

3. Mixed (Hybrid) Fragmentation

Combination of horizontal and vertical

Most flexible approach

Choosing the Right Fragmentation Strategy:

Consider:

• Application access patterns

• Query frequency and types

• Data distribution requirements

• Network bandwidth and latency

• Storage capacity at each site

Horizontal vs Vertical vs Hybrid

Compare & Contrast (with examples)

Type Split By Reconstruction Typical Use Example
Horizontal Rows (tuples) UNION (∪) Geography / campus / region-based access Employee split by Location
Vertical Columns (attributes) JOIN (⋈) on PK Different apps use different columns; security separation Employee split into (EmpID,Name,Dept) and (EmpID,Salary)
Hybrid Rows + Columns JOIN + UNION (sequence dependent) Complex workloads: locality + security + performance Vertical first, then horizontal on a subset

Key idea to remember

Horizontal keep all attributes, split tuples using predicates.

Vertical keep PK in every fragment, split non-key attributes by usage affinity.

Hybrid use when you need both locality (horizontal) and attribute grouping/security (vertical).

Case Study: University Distributed DB

STUDENT table and fragmentation across 3 sites

Sites: Site A (Main Campus), Site B (Engineering Campus), Site C (Medical Campus)

Table: STUDENT(student_id, name, department, gpa, campus_location, enrollment_year, email)

Recommended fragmentation (Primary Horizontal by campus)

STUDENT_A = σcampus_location='Main Campus'(STUDENT) → stored at Site A

STUDENT_B = σcampus_location='Engineering Campus'(STUDENT) → stored at Site B

STUDENT_C = σcampus_location='Medical Campus'(STUDENT) → stored at Site C

Reconstruction: STUDENT = STUDENT_A ∪ STUDENT_B ∪ STUDENT_C

Allocation map
Fragment Predicate Stored at Reason
STUDENT_A campus_location='Main Campus' Site A Max locality for Main Campus operations
STUDENT_B campus_location='Engineering Campus' Site B Engineering admin and apps are local
STUDENT_C campus_location='Medical Campus' Site C Medical campus workloads stay local

Case Study: Reconstruction & Constraint Notes

How queries and constraints behave

Reconstruction (Global View)

STUDENT = STUDENT_A ∪ STUDENT_B ∪ STUDENT_C

A global query like "count students by department" can execute in parallel at each site and then merge results.

Primary Key (student_id) handling

Goal: student_id must be unique across all sites.

Common solutions:

• Central/Global ID generator (or campus-prefixed IDs like A-1001, B-2001, C-3001)

• Distributed unique ID strategies (UUID/ULID) to avoid collisions

Foreign Key idea (derived fragmentation)

If a table ENROLLMENT(student_id, course_id, ...) references STUDENT(student_id), then fragment ENROLLMENT based on STUDENT fragments:

ENROLL_A joins with STUDENT_A, stored at Site A (local FK checks)

ENROLL_B joins with STUDENT_B, stored at Site B

ENROLL_C joins with STUDENT_C, stored at Site C

What to mention in exams

• Use derived horizontal fragmentation to co-locate related tuples (helps FK enforcement)

• Global constraints may require coordination (e.g., distributed commit) if updates span sites

Horizontal Fragmentation

Concept & Mechanism

Horizontal fragmentation divides a relation into subsets of tuples (rows). Each fragment contains all columns but only selected rows based on a predicate.

Example: Employee Table

EmpID Name Department Salary Location
101 John Smith Sales 50000 Delhi
102 Sarah Jones Engineering 70000 Mumbai
103 Mike Brown Sales 55000 Delhi
104 Lisa White HR 60000 Mumbai

Fragmentation by Location:

Fragment 1 (Delhi Site): σLocation='Delhi'(Employee)

Fragment 2 (Mumbai Site): σLocation='Mumbai'(Employee)

Reconstruction:

Employee = Fragment1 ∪ Fragment2

Using UNION operation to combine all fragments

Types of Horizontal Fragmentation

1. Primary Horizontal Fragmentation

Based on predicates defined on the relation itself

Example: Fragmenting Employee table by Location attribute

F1 = σLocation='Delhi'(Employee)
F2 = σLocation='Mumbai'(Employee)

2. Derived Horizontal Fragmentation

Based on predicates defined on another relation (partitioning based on a foreign key relationship)

Example: Fragment Project table based on Employee fragmentation

If Employee is fragmented by Location,
Project can be fragmented based on Manager's Location (semi-join)

Advantages of Horizontal Fragmentation:

✓ Supports parallel processing effectively

✓ Simple queries access only relevant fragments

✓ Natural partitioning for geographic distribution

✓ Improved performance for selection queries

Vertical Fragmentation

Concept & Mechanism

Vertical fragmentation divides a relation by columns (attributes). Each fragment contains a subset of attributes, but the primary key must be replicated in all fragments for reconstruction.

Example: Employee Table Vertical Fragmentation

Original Relation: Employee(EmpID, Name, Dept, Salary, Address, Phone)

Fragment 1 (Basic Info)

EmpID Name Dept
101JohnSales
102SarahEng

Fragment 2 (Financial)

EmpID Salary
10150000
10270000

Fragment 3 (Contact)

EmpID Address Phone
101Delhi99999
102Mumbai88888

Reconstruction:

Employee = Fragment1 ⋈EmpID Fragment2 ⋈EmpID Fragment3

Using NATURAL JOIN operation on primary key

Vertical Fragmentation: Deep Dive

Key Principles:

1. Primary Key Replication: Every fragment must include the primary key

2. Completeness: Union of all attributes in fragments = all attributes in original relation

3. Disjointness: Non-key attributes appear in exactly one fragment

Maintaining Integrity Constraints across Vertical Fragments

Primary Key (PK): The PK attribute(s) are present in every vertical fragment, but uniqueness must still be enforced globally.

• Enforce inserts/updates using distributed transactions so all fragments are updated atomically (e.g., 2PC at commit).

• Use a global ID strategy to avoid PK collisions (central ID generator, UUID/ULID, or site-prefixed keys).

Foreign Key (FK): A referencing table typically points to the PK of the referenced table.

Common strategies in a DDBMS:

Co-location / derived fragmentation: keep referencing tuples at the same site as the referenced PK fragment to allow local FK checks.

Replicate key columns: keep a small “key directory” (PK values) where needed for validation.

Distributed referential integrity checks: validate at commit time using distributed messaging/2PC (costly but correct).

Mini Example

STUDENT_PUBLIC(student_id, name, department, campus_location)

STUDENT_PRIVATE(student_id, gpa, email)

ENROLLMENT(student_id, course_id, semester) references STUDENT_PUBLIC(student_id). If STUDENT_PUBLIC is co-located/replicated appropriately, FK checks can be local.

When to Use Vertical Fragmentation?

✓ Different applications access different attributes

✓ Security requirements (separate sensitive data)

✓ Performance optimization (frequently accessed attributes together)

✓ Reduce I/O for queries needing only few attributes

Design Considerations:

Attribute Affinity: Group attributes accessed together

Access Frequency: Consider query patterns and frequency

Join Cost: Balance fragmentation benefits vs reconstruction overhead

Challenges:

✗ Requires join operations for reconstruction

✗ Complex query processing for multi-attribute queries

✗ Increased storage due to key replication

Mixed (Hybrid) Fragmentation

Combining Horizontal and Vertical Fragmentation

Mixed fragmentation applies horizontal and vertical fragmentation in sequence, providing the most flexible data distribution strategy.

Two Approaches:

Vertical-then-Horizontal

1. First apply vertical fragmentation

2. Then apply horizontal fragmentation to vertical fragments

Horizontal-then-Vertical

1. First apply horizontal fragmentation

2. Then apply vertical fragmentation to horizontal fragments

Example Process:

Step 1: Vertical fragmentation of Employee table

E1 = πEmpID, Name, Dept(Employee)

E2 = πEmpID, Salary, Location(Employee)

Step 2: Horizontal fragmentation of E2

E2_Delhi = σLocation='Delhi'(E2)

E2_Mumbai = σLocation='Mumbai'(E2)

Benefits:

✓ Maximum flexibility in data distribution

✓ Optimizes both query performance and data locality

✓ Supports complex application requirements

Interactive Fragmentation Demo

Original Employee Table:

EmpID Name Dept Salary City
1AliceHR60000Delhi
2BobIT75000Mumbai
3CharlieSales55000Delhi
4DianaIT80000Mumbai

Try Different Fragmentations:

Select a fragmentation type to see the results...

Fragmentation Rules & Correctness

Three Fundamental Rules:

1. Completeness Rule

If a relation R is decomposed into fragments R1, R2, ..., Rn, each data item in R must appear in at least one fragment Ri.

∀ tuple t ∈ R, ∃ i: t ∈ Ri

2. Reconstruction Rule

It must be possible to reconstruct the original relation R from its fragments.

Horizontal: R = R1 ∪ R2 ∪ ... ∪ Rn (UNION)

Vertical: R = R1 ⋈ R2 ⋈ ... ⋈ Rn (JOIN)

3. Disjointness Rule

If a data item appears in fragment Ri, it should not appear in any other fragment Rj (i ≠ j).

Exception: Primary keys in vertical fragmentation (needed for reconstruction)

Horizontal: Ri ∩ Rj = ∅ for all i ≠ j

Vertical: Ri ∩ Rj = Primary Key attributes only

Fragmentation: Pros & Cons

Advantages:

1. Improved Performance: Parallel query execution and reduced data access time

2. Data Locality: Data stored close to where it's used most frequently

3. Reduced Communication Cost: Less data transfer across network

4. Enhanced Availability: System continues despite site failures

5. Scalability: Easy to add new sites and redistribute data

6. Security: Sensitive data can be isolated at secure sites

7. Load Distribution: Better resource utilization across sites

Disadvantages:

1. Complexity: More complex query processing and optimization

2. Semantic Integrity: Harder to enforce constraints across fragments

3. Join Overhead: Reconstruction requires expensive operations

4. Design Difficulty: Requires careful analysis of access patterns

5. Maintenance Cost: Updates may affect multiple fragments

6. Storage Overhead: Key replication in vertical fragmentation

Fragmentation Design Process

Step-by-Step Methodology:

Step 1: Requirements Analysis

• Identify applications and their data access patterns

• Analyze query frequency and types

• Determine update patterns

• Consider geographic distribution of users

Step 2: Quantitative Analysis

• Create attribute usage matrix

• Calculate attribute affinity (for vertical fragmentation)

• Analyze predicate usage (for horizontal fragmentation)

• Estimate access frequencies

Step 3: Fragmentation Schema Design

• Choose fragmentation type(s)

• Define fragmentation predicates

• Define attribute groups (for vertical fragmentation)

• Decide allocation: which site stores which fragment

• Decide replication: which fragments (or key directories) should be copied

• Validate correctness: completeness, reconstruction, disjointness

Step 4: Implementation & Monitoring

• Implement fragments, global schema mappings, and access rules

• Measure query latency, network cost, and update overhead

• Re-tune fragmentation if access patterns change

Summary

Key Takeaways

✓ Fragmentation improves locality, parallelism, and scalability

✓ Horizontal fragmentation reconstructs with UNION

✓ Vertical fragmentation reconstructs with JOIN and requires PK replication

✓ FK enforcement is easier with co-location / derived fragmentation

✓ University example: fragment STUDENT by campus_location across Site A/B/C

Thank You!

1 / 1