Understand why tree structures are preferred for database indexing
Explain the structure and operations of B-trees
Differentiate between B-trees and B+ trees
Analyze why modern DBMS prefer B+ trees for indexing
Perform search, insert, and delete operations on tree-based indexes
Why Trees for Indexing?
The Problem with Alternatives
1. Arrays
Search: O(log n) with binary search (sorted array)
Insert/Delete: O(n) - requires shifting elements
Problem: Inefficient for dynamic data with frequent updates
2. Binary Search Trees (BST)
Average case: O(log n) for all operations
Worst case: O(n) - degenerates to linked list
Problem: Not balanced, poor disk I/O performance
Issue: High tree height = more disk accesses
Key Insight: Database indexes are stored on disk. Minimizing disk I/O is critical. We need balanced trees with high fan-out to reduce tree height and disk accesses.
B-Trees: The Solution
What is a B-Tree?
A B-Tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time.
Key Characteristics
✓ Balanced
All leaf nodes are at the same level
✓ High Fan-out
Many children per node (not just 2)
✓ Sorted Order
Keys within nodes are sorted
✓ Disk-Friendly
Optimized for block-based I/O
Order of B-Tree (m): Maximum number of children a node can have
B-Tree Structure
Properties of a B-Tree of order m
Every node has at most m children
Every non-leaf node (except root) has at least ⌈m/2⌉ children
Root has at least 2 children (if not a leaf)
All leaves appear at the same level
A non-leaf node with k children contains k-1 keys
Keys in a node are sorted: K₁ < K₂ < ... < Kₖ₋₁
Example: B-Tree of order 3
50
20 | 30
60 | 70
10 | 15
25
35 | 40
55
65
75 | 80
B-Tree Node Structure
Internal Node Structure
P₀
K₁
P₁
K₂
P₂
...
Kₙ
Pₙ
Kᵢ: Search keys | Pᵢ: Pointers to child nodes or data records
Key Ordering Property
All keys in subtree P₀ < K₁
All keys in subtree P₁ are between K₁ and K₂
All keys in subtree P₂ are between K₂ and K₃
All keys in subtree Pₙ > Kₙ
Data Location: In B-trees, data records (or pointers to data) can be stored in both internal nodes and leaf nodes.
B-Tree Search Operation
Algorithm: Search(key, node)
Start at root node
In current node:
Perform binary search to find position i where Kᵢ₋₁ < key ≤ Kᵢ
If key == Kᵢ, return found (data is here)
If not found and not leaf:
Follow pointer Pᵢ to appropriate child
Repeat step 2
If leaf reached and not found: Key doesn't exist
Time Complexity
O(logm n) where m is the order and n is the number of keys
For m = 100 and n = 1,000,000: Only ~3 disk accesses needed!
Search(50) in B-Tree:
Root: [30, 60] → 50 < 60, go to middle child
Child: [40, 50, 55] → Found at index 1!
Result: 2 disk I/O operations
B-Tree Insert Operation
Algorithm: Insert(key)
Search for the appropriate leaf node where key should be inserted
Insert key in sorted order in the leaf node
If node overflows (more than m-1 keys):
Split the node into two nodes
Move median key to parent
If parent overflows, split recursively
If root splits, create new root (tree height increases)
Example: Insert 25 in B-Tree (order 3)
Before:
10 | 20 | 30
↓ Node is full, insert causes overflow
After Split:
20
↙ ↘
10
25 | 30
B-Tree Delete Operation
Algorithm: Delete(key) - Simplified
Locate the key in the tree
If key is in a leaf node:
Simply remove it
If underflow (< ⌈m/2⌉-1 keys), borrow from sibling or merge
If key is in internal node:
Replace with predecessor or successor
Delete the predecessor/successor from leaf
Handle underflow:
Borrow: If sibling has extra keys, redistribute
Merge: Combine with sibling if borrowing not possible
Important: Delete is the most complex operation in B-Trees due to maintaining balance and minimum key requirements. Underflow propagates upward, possibly reducing tree height.
B+ Trees: An Enhanced Variant
What is a B+ Tree?
A B+ Tree is a variation of B-Tree where all data records reside only at the leaf level, and internal nodes contain only keys for navigation.
Key Differences from B-Tree
📍 Data Location
B-Tree: Data in all nodes B+ Tree: Data only in leaves
🔗 Leaf Structure
B-Tree: Leaves independent B+ Tree: Leaves form linked list
🔑 Key Duplication
B-Tree: Keys appear once B+ Tree: Keys duplicated in internal nodes
🎯 Search Path
B-Tree: Can end anywhere B+ Tree: Always reaches leaf
B+ Tree Structure
Properties
Internal nodes: Store only keys and pointers (no data records)
Leaf nodes: Store keys and data records (or pointers to data)
Leaf nodes are linked: Forms a doubly linked list for sequential access
Key duplication: Keys in internal nodes are duplicated in leaves
All properties of B-Tree apply (balance, order, min/max children)
B+ Tree Structure
50
Internal Node (keys only)
20 | 30
60 | 70
Internal Nodes
10:D1 | 20:D2
30:D3 | 40:D4
50:D5 | 60:D6
70:D7 | 80:D8
Leaf Nodes (with data) ← → ← → ← →
B+ Tree Operations
Search Operation
Start at root, navigate using keys in internal nodes
Always traverse to leaf level (even if key found in internal node)
Search key in leaf node
Return data record if found
Time Complexity: O(logm n)
Range Query
Major Advantage of B+ Trees!
Query: Find all records where 20 ≤ key ≤ 60
Search for key = 20 (reaches leaf)
Follow linked list of leaves from 20 to 60
Collect all records in range
Efficient Sequential Access: No need to traverse tree multiple times!
B+ Tree Insert & Delete
Insert Operation
Search for appropriate leaf node
Insert key-data pair in sorted order in leaf
If leaf overflows:
Split leaf into two leaves
Copy (not move) middle key to parent
Update leaf linked list pointers
Handle parent overflow recursively
Delete Operation
Locate key in leaf node
Remove key-data pair from leaf
If underflow:
Borrow from sibling or merge leaves
Update parent keys accordingly
Update linked list pointers
Note: Keys in internal nodes may remain even after deletion from leaves (not critical as they guide search).
B-Tree vs B+ Tree Comparison
Feature
B-Tree
B+ Tree
Data Location
All nodes (internal & leaf)
Only leaf nodes
Key Duplication
No duplication
Keys duplicated in internal nodes
Leaf Node Structure
Independent leaves
Linked list of leaves
Search Termination
Can end at any level
Always at leaf level
Range Queries
Requires multiple tree traversals
Efficient via leaf linked list
Sequential Access
Difficult (requires tree traversal)
Easy (follow leaf pointers)
Node Utilization
Lower (stores data)
Higher (only keys in internal)
Tree Height
Slightly shorter
Slightly taller (but negligible)
Deletion Complexity
More complex (data in all nodes)
Simpler (data only in leaves)
Why Modern DBMS Prefer B+ Trees
1️⃣ Better Range Queries
Sequential access through linked leaves is much faster than traversing the tree multiple times
2️⃣ Higher Fan-out
Internal nodes can store more keys (no data), reducing tree height and disk I/O
3️⃣ Simpler Deletion
Data only in leaves makes deletion logic cleaner and more predictable
4️⃣ Consistent Performance
All searches go to leaf level, providing predictable I/O cost
5️⃣ Better Caching
Internal nodes are smaller, more fit in memory/cache
6️⃣ Full Scans
Scanning entire table is just a leaf-level linked list traversal
Real-World Usage
MySQL InnoDB: Uses B+ Trees for indexes PostgreSQL: Uses B+ Trees for indexes Oracle: Uses B+ Trees for indexes SQL Server: Uses B+ Trees for indexes
Summary
Key Takeaways
Tree-based indexing is essential for efficient database operations
B-Trees provide balanced, disk-friendly indexing with O(log n) operations
B+ Trees enhance B-Trees by storing data only in leaves
Linked leaf nodes in B+ Trees enable efficient range queries and sequential access
Modern DBMS universally prefer B+ Trees for their superior performance characteristics
What's Next?
Hash-based indexing
Bitmap indexes
Composite indexes
Index selection and optimization strategies
Practice: Try implementing B+ Tree insert and delete operations on paper to solidify your understanding!