Tree-Based Indexing

B-Tree & B+ Tree

Lecture-20 | Database Systems
Dr. Mohsin Dar
Cloud & Software Operations Cluster | UPES
MTech First Semester

Learning Objectives

By the end of this lecture, you will be able to:

  • Understand why tree structures are preferred for database indexing
  • Explain the structure and operations of B-trees
  • Differentiate between B-trees and B+ trees
  • Analyze why modern DBMS prefer B+ trees for indexing
  • Perform search, insert, and delete operations on tree-based indexes

Why Trees for Indexing?

The Problem with Alternatives

1. Arrays

2. Binary Search Trees (BST)

Key Insight: Database indexes are stored on disk. Minimizing disk I/O is critical. We need balanced trees with high fan-out to reduce tree height and disk accesses.

B-Trees: The Solution

What is a B-Tree?

A B-Tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time.

Key Characteristics

✓ Balanced

All leaf nodes are at the same level

✓ High Fan-out

Many children per node (not just 2)

✓ Sorted Order

Keys within nodes are sorted

✓ Disk-Friendly

Optimized for block-based I/O

Order of B-Tree (m): Maximum number of children a node can have

B-Tree Structure

Properties of a B-Tree of order m

Example: B-Tree of order 3

50
20 | 30
60 | 70
10 | 15
25
35 | 40
55
65
75 | 80

B-Tree Node Structure

Internal Node Structure

P₀
K₁
P₁
K₂
P₂
...
Kₙ
Pₙ

Kᵢ: Search keys | Pᵢ: Pointers to child nodes or data records

Key Ordering Property

Data Location: In B-trees, data records (or pointers to data) can be stored in both internal nodes and leaf nodes.

B-Tree Search Operation

Algorithm: Search(key, node)

  1. Start at root node
  2. In current node:
    • Perform binary search to find position i where Kᵢ₋₁ < key ≤ Kᵢ
    • If key == Kᵢ, return found (data is here)
  3. If not found and not leaf:
    • Follow pointer Pᵢ to appropriate child
    • Repeat step 2
  4. If leaf reached and not found: Key doesn't exist

Time Complexity

O(logm n) where m is the order and n is the number of keys

For m = 100 and n = 1,000,000: Only ~3 disk accesses needed!

Search(50) in B-Tree: Root: [30, 60] → 50 < 60, go to middle child Child: [40, 50, 55] → Found at index 1! Result: 2 disk I/O operations

B-Tree Insert Operation

Algorithm: Insert(key)

  1. Search for the appropriate leaf node where key should be inserted
  2. Insert key in sorted order in the leaf node
  3. If node overflows (more than m-1 keys):
    • Split the node into two nodes
    • Move median key to parent
    • If parent overflows, split recursively
    • If root splits, create new root (tree height increases)

Example: Insert 25 in B-Tree (order 3)

Before:

10 | 20 | 30

↓ Node is full, insert causes overflow

After Split:

20

↙ ↘

10
25 | 30

B-Tree Delete Operation

Algorithm: Delete(key) - Simplified

  1. Locate the key in the tree
  2. If key is in a leaf node:
    • Simply remove it
    • If underflow (< ⌈m/2⌉-1 keys), borrow from sibling or merge
  3. If key is in internal node:
    • Replace with predecessor or successor
    • Delete the predecessor/successor from leaf
  4. Handle underflow:
    • Borrow: If sibling has extra keys, redistribute
    • Merge: Combine with sibling if borrowing not possible
Important: Delete is the most complex operation in B-Trees due to maintaining balance and minimum key requirements. Underflow propagates upward, possibly reducing tree height.

B+ Trees: An Enhanced Variant

What is a B+ Tree?

A B+ Tree is a variation of B-Tree where all data records reside only at the leaf level, and internal nodes contain only keys for navigation.

Key Differences from B-Tree

📍 Data Location

B-Tree: Data in all nodes
B+ Tree: Data only in leaves

🔗 Leaf Structure

B-Tree: Leaves independent
B+ Tree: Leaves form linked list

🔑 Key Duplication

B-Tree: Keys appear once
B+ Tree: Keys duplicated in internal nodes

🎯 Search Path

B-Tree: Can end anywhere
B+ Tree: Always reaches leaf

B+ Tree Structure

Properties

B+ Tree Structure

50

Internal Node (keys only)

20 | 30
60 | 70

Internal Nodes

10:D1 | 20:D2
30:D3 | 40:D4
50:D5 | 60:D6
70:D7 | 80:D8

Leaf Nodes (with data) ← → ← → ← →

B+ Tree Operations

Search Operation

  1. Start at root, navigate using keys in internal nodes
  2. Always traverse to leaf level (even if key found in internal node)
  3. Search key in leaf node
  4. Return data record if found

Time Complexity: O(logm n)

Range Query

Major Advantage of B+ Trees!

Query: Find all records where 20 ≤ key ≤ 60

  1. Search for key = 20 (reaches leaf)
  2. Follow linked list of leaves from 20 to 60
  3. Collect all records in range

Efficient Sequential Access: No need to traverse tree multiple times!

B+ Tree Insert & Delete

Insert Operation

  1. Search for appropriate leaf node
  2. Insert key-data pair in sorted order in leaf
  3. If leaf overflows:
    • Split leaf into two leaves
    • Copy (not move) middle key to parent
    • Update leaf linked list pointers
  4. Handle parent overflow recursively

Delete Operation

  1. Locate key in leaf node
  2. Remove key-data pair from leaf
  3. If underflow:
    • Borrow from sibling or merge leaves
    • Update parent keys accordingly
    • Update linked list pointers
Note: Keys in internal nodes may remain even after deletion from leaves (not critical as they guide search).

B-Tree vs B+ Tree Comparison

Feature B-Tree B+ Tree
Data Location All nodes (internal & leaf) Only leaf nodes
Key Duplication No duplication Keys duplicated in internal nodes
Leaf Node Structure Independent leaves Linked list of leaves
Search Termination Can end at any level Always at leaf level
Range Queries Requires multiple tree traversals Efficient via leaf linked list
Sequential Access Difficult (requires tree traversal) Easy (follow leaf pointers)
Node Utilization Lower (stores data) Higher (only keys in internal)
Tree Height Slightly shorter Slightly taller (but negligible)
Deletion Complexity More complex (data in all nodes) Simpler (data only in leaves)

Why Modern DBMS Prefer B+ Trees

1️⃣ Better Range Queries

Sequential access through linked leaves is much faster than traversing the tree multiple times

2️⃣ Higher Fan-out

Internal nodes can store more keys (no data), reducing tree height and disk I/O

3️⃣ Simpler Deletion

Data only in leaves makes deletion logic cleaner and more predictable

4️⃣ Consistent Performance

All searches go to leaf level, providing predictable I/O cost

5️⃣ Better Caching

Internal nodes are smaller, more fit in memory/cache

6️⃣ Full Scans

Scanning entire table is just a leaf-level linked list traversal

Real-World Usage

MySQL InnoDB: Uses B+ Trees for indexes
PostgreSQL: Uses B+ Trees for indexes
Oracle: Uses B+ Trees for indexes
SQL Server: Uses B+ Trees for indexes

Summary

Key Takeaways

  • Tree-based indexing is essential for efficient database operations
  • B-Trees provide balanced, disk-friendly indexing with O(log n) operations
  • B+ Trees enhance B-Trees by storing data only in leaves
  • Linked leaf nodes in B+ Trees enable efficient range queries and sequential access
  • Modern DBMS universally prefer B+ Trees for their superior performance characteristics

What's Next?

Practice: Try implementing B+ Tree insert and delete operations on paper to solidify your understanding!
1 / 16