Understanding Data Distribution Strategies
Assistant Professor, Cloud & Software Operations Cluster | SOCS | UPES
Course: Database Systems | M.Tech. First Semester
Press Next to Begin →
Data fragmentation is the process of dividing a relation into smaller fragments and storing these fragments at different sites in a distributed database system.
1. Locality of Reference: Data is stored close to where it's most frequently used
2. Improved Performance: Parallel query execution on fragments
3. Reduced Communication Costs: Minimize data transfer across network
4. Enhanced Availability: Failure of one site doesn't affect all data
5. Load Balancing: Distribute workload across multiple sites
Fragments must be COMPLETE and RECONSTRUCTIBLE
• Completeness: All data in the original relation must appear in at least one fragment
• Reconstruction: Must be able to recreate the original relation from fragments
• Disjointness: (Optional) Data items should appear in only one fragment
Divides relation by rows (tuples)
Each fragment contains a subset of rows
Divides relation by columns (attributes)
Each fragment contains a subset of columns
Combination of horizontal and vertical
Most flexible approach
Consider:
• Application access patterns
• Query frequency and types
• Data distribution requirements
• Network bandwidth and latency
• Storage capacity at each site
| Type | Split By | Reconstruction | Typical Use | Example |
|---|---|---|---|---|
| Horizontal | Rows (tuples) | UNION (∪) | Geography / campus / region-based access | Employee split by Location |
| Vertical | Columns (attributes) | JOIN (⋈) on PK | Different apps use different columns; security separation | Employee split into (EmpID,Name,Dept) and (EmpID,Salary) |
| Hybrid | Rows + Columns | JOIN + UNION (sequence dependent) | Complex workloads: locality + security + performance | Vertical first, then horizontal on a subset |
Horizontal keep all attributes, split tuples using predicates.
Vertical keep PK in every fragment, split non-key attributes by usage affinity.
Hybrid use when you need both locality (horizontal) and attribute grouping/security (vertical).
Sites: Site A (Main Campus), Site B (Engineering Campus), Site C (Medical Campus)
Table: STUDENT(student_id, name, department, gpa, campus_location, enrollment_year, email)
STUDENT_A = σcampus_location='Main Campus'(STUDENT) → stored at Site A
STUDENT_B = σcampus_location='Engineering Campus'(STUDENT) → stored at Site B
STUDENT_C = σcampus_location='Medical Campus'(STUDENT) → stored at Site C
Reconstruction: STUDENT = STUDENT_A ∪ STUDENT_B ∪ STUDENT_C
| Fragment | Predicate | Stored at | Reason |
|---|---|---|---|
| STUDENT_A | campus_location='Main Campus' | Site A | Max locality for Main Campus operations |
| STUDENT_B | campus_location='Engineering Campus' | Site B | Engineering admin and apps are local |
| STUDENT_C | campus_location='Medical Campus' | Site C | Medical campus workloads stay local |
STUDENT = STUDENT_A ∪ STUDENT_B ∪ STUDENT_C
A global query like "count students by department" can execute in parallel at each site and then merge results.
Goal: student_id must be unique across all sites.
Common solutions:
• Central/Global ID generator (or campus-prefixed IDs like A-1001, B-2001, C-3001)
• Distributed unique ID strategies (UUID/ULID) to avoid collisions
If a table ENROLLMENT(student_id, course_id, ...) references STUDENT(student_id), then fragment ENROLLMENT based on STUDENT fragments:
ENROLL_A joins with STUDENT_A, stored at Site A (local FK checks)
ENROLL_B joins with STUDENT_B, stored at Site B
ENROLL_C joins with STUDENT_C, stored at Site C
• Use derived horizontal fragmentation to co-locate related tuples (helps FK enforcement)
• Global constraints may require coordination (e.g., distributed commit) if updates span sites
Horizontal fragmentation divides a relation into subsets of tuples (rows). Each fragment contains all columns but only selected rows based on a predicate.
| EmpID | Name | Department | Salary | Location |
|---|---|---|---|---|
| 101 | John Smith | Sales | 50000 | Delhi |
| 102 | Sarah Jones | Engineering | 70000 | Mumbai |
| 103 | Mike Brown | Sales | 55000 | Delhi |
| 104 | Lisa White | HR | 60000 | Mumbai |
Fragment 1 (Delhi Site): σLocation='Delhi'(Employee)
Fragment 2 (Mumbai Site): σLocation='Mumbai'(Employee)
Employee = Fragment1 ∪ Fragment2
Using UNION operation to combine all fragments
Based on predicates defined on the relation itself
Example: Fragmenting Employee table by Location attribute
F1 = σLocation='Delhi'(Employee)
F2 = σLocation='Mumbai'(Employee)
Based on predicates defined on another relation (partitioning based on a foreign key relationship)
Example: Fragment Project table based on Employee fragmentation
If Employee is fragmented by Location,
Project can be fragmented based on Manager's Location (semi-join)
✓ Supports parallel processing effectively
✓ Simple queries access only relevant fragments
✓ Natural partitioning for geographic distribution
✓ Improved performance for selection queries
Vertical fragmentation divides a relation by columns (attributes). Each fragment contains a subset of attributes, but the primary key must be replicated in all fragments for reconstruction.
Original Relation: Employee(EmpID, Name, Dept, Salary, Address, Phone)
| EmpID | Name | Dept |
|---|---|---|
| 101 | John | Sales |
| 102 | Sarah | Eng |
| EmpID | Salary |
|---|---|
| 101 | 50000 |
| 102 | 70000 |
| EmpID | Address | Phone |
|---|---|---|
| 101 | Delhi | 99999 |
| 102 | Mumbai | 88888 |
Employee = Fragment1 ⋈EmpID Fragment2 ⋈EmpID Fragment3
Using NATURAL JOIN operation on primary key
1. Primary Key Replication: Every fragment must include the primary key
2. Completeness: Union of all attributes in fragments = all attributes in original relation
3. Disjointness: Non-key attributes appear in exactly one fragment
Primary Key (PK): The PK attribute(s) are present in every vertical fragment, but uniqueness must still be enforced globally.
• Enforce inserts/updates using distributed transactions so all fragments are updated atomically (e.g., 2PC at commit).
• Use a global ID strategy to avoid PK collisions (central ID generator, UUID/ULID, or site-prefixed keys).
Foreign Key (FK): A referencing table typically points to the PK of the referenced table.
Common strategies in a DDBMS:
• Co-location / derived fragmentation: keep referencing tuples at the same site as the referenced PK fragment to allow local FK checks.
• Replicate key columns: keep a small “key directory” (PK values) where needed for validation.
• Distributed referential integrity checks: validate at commit time using distributed messaging/2PC (costly but correct).
STUDENT_PUBLIC(student_id, name, department, campus_location)
STUDENT_PRIVATE(student_id, gpa, email)
ENROLLMENT(student_id, course_id, semester) references STUDENT_PUBLIC(student_id). If STUDENT_PUBLIC is co-located/replicated appropriately, FK checks can be local.
✓ Different applications access different attributes
✓ Security requirements (separate sensitive data)
✓ Performance optimization (frequently accessed attributes together)
✓ Reduce I/O for queries needing only few attributes
Attribute Affinity: Group attributes accessed together
Access Frequency: Consider query patterns and frequency
Join Cost: Balance fragmentation benefits vs reconstruction overhead
✗ Requires join operations for reconstruction
✗ Complex query processing for multi-attribute queries
✗ Increased storage due to key replication
Mixed fragmentation applies horizontal and vertical fragmentation in sequence, providing the most flexible data distribution strategy.
1. First apply vertical fragmentation
2. Then apply horizontal fragmentation to vertical fragments
1. First apply horizontal fragmentation
2. Then apply vertical fragmentation to horizontal fragments
Step 1: Vertical fragmentation of Employee table
E1 = πEmpID, Name, Dept(Employee)
E2 = πEmpID, Salary, Location(Employee)
Step 2: Horizontal fragmentation of E2
E2_Delhi = σLocation='Delhi'(E2)
E2_Mumbai = σLocation='Mumbai'(E2)
✓ Maximum flexibility in data distribution
✓ Optimizes both query performance and data locality
✓ Supports complex application requirements
| EmpID | Name | Dept | Salary | City |
|---|---|---|---|---|
| 1 | Alice | HR | 60000 | Delhi |
| 2 | Bob | IT | 75000 | Mumbai |
| 3 | Charlie | Sales | 55000 | Delhi |
| 4 | Diana | IT | 80000 | Mumbai |
If a relation R is decomposed into fragments R1, R2, ..., Rn, each data item in R must appear in at least one fragment Ri.
∀ tuple t ∈ R, ∃ i: t ∈ Ri
It must be possible to reconstruct the original relation R from its fragments.
Horizontal: R = R1 ∪ R2 ∪ ... ∪ Rn (UNION)
Vertical: R = R1 ⋈ R2 ⋈ ... ⋈ Rn (JOIN)
If a data item appears in fragment Ri, it should not appear in any other fragment Rj (i ≠ j).
Exception: Primary keys in vertical fragmentation (needed for reconstruction)
Horizontal: Ri ∩ Rj = ∅ for all i ≠ j
Vertical: Ri ∩ Rj = Primary Key attributes only
1. Improved Performance: Parallel query execution and reduced data access time
2. Data Locality: Data stored close to where it's used most frequently
3. Reduced Communication Cost: Less data transfer across network
4. Enhanced Availability: System continues despite site failures
5. Scalability: Easy to add new sites and redistribute data
6. Security: Sensitive data can be isolated at secure sites
7. Load Distribution: Better resource utilization across sites
1. Complexity: More complex query processing and optimization
2. Semantic Integrity: Harder to enforce constraints across fragments
3. Join Overhead: Reconstruction requires expensive operations
4. Design Difficulty: Requires careful analysis of access patterns
5. Maintenance Cost: Updates may affect multiple fragments
6. Storage Overhead: Key replication in vertical fragmentation
• Identify applications and their data access patterns
• Analyze query frequency and types
• Determine update patterns
• Consider geographic distribution of users
• Create attribute usage matrix
• Calculate attribute affinity (for vertical fragmentation)
• Analyze predicate usage (for horizontal fragmentation)
• Estimate access frequencies
• Choose fragmentation type(s)
• Define fragmentation predicates
• Define attribute groups (for vertical fragmentation)
• Decide allocation: which site stores which fragment
• Decide replication: which fragments (or key directories) should be copied
• Validate correctness: completeness, reconstruction, disjointness
• Implement fragments, global schema mappings, and access rules
• Measure query latency, network cost, and update overhead
• Re-tune fragmentation if access patterns change
✓ Fragmentation improves locality, parallelism, and scalability
✓ Horizontal fragmentation reconstructs with UNION
✓ Vertical fragmentation reconstructs with JOIN and requires PK replication
✓ FK enforcement is easier with co-location / derived fragmentation
✓ University example: fragment STUDENT by campus_location across Site A/B/C
Thank You!