Hash-Based Indexing

Database Systems - Unit III

Dr. Mohsin Dar

Assistant Professor

Cloud & Software Operations Cluster

UPES - MTech First Semester

Introduction to Hash-Based Indexing

What is Hashing?

Hashing is a technique that uses a hash function to compute the address of a data record based on its search key value
Provides O(1) average time complexity for search, insert, and delete operations
Alternative to tree-based indexing (B-Trees, B+ Trees)

Key Components

Hash Function h(k): Maps search key k to a bucket address
Buckets: Storage units that hold one or more records
Collision: When multiple keys hash to the same bucket

Bucket Address = h(search_key) mod N

Figure: Visual representation of hashing process

Hash Function Visualization

Example: h(k) = k mod 5

Keys: 12, 17, 23, 8, 31, 19

↓ Apply Hash Function ↓

Bucket 0

Bucket 1

31

Bucket 2

12

17

Bucket 3

23

8

Bucket 4

19

Static Hashing

Definition

Fixed number of buckets allocated at creation time
Hash function maps keys to a fixed range of bucket addresses
Simplest form of hash-based indexing

Characteristics

Fixed Structure: Number of buckets (N) remains constant
Overflow Handling: Uses overflow buckets/chains for collisions
Performance: Degrades when load factor increases significantly

Load Factor (α) = (Number of Records) / (Number of Buckets × Bucket Capacity)

✓ Advantages:

Simple implementation
Fast access when properly configured
Predictable memory usage

Problems with Static Hashing

Major Limitations:

Database Growth: Performance degrades as more records are added beyond initial capacity
Long Overflow Chains: Multiple collisions create linked lists, degrading to O(n) search time
Fixed Allocation: Cannot dynamically adjust to changing data volumes
Space Wastage: Pre-allocating too many buckets wastes space; too few causes overflow

Solution Needed:

Dynamic hashing techniques that can grow and shrink based on the number of records

↓

Extendable Hashing & Linear Hashing

Extendable Hashing

Concept

Dynamic hashing technique that allows the hash structure to grow and shrink dynamically
Uses a directory (lookup table) that can double in size when needed
Based on binary representation of hash values

Key Components

Global Depth (d): Number of bits used from the hash value for directory indexing
Local Depth (d'): Number of bits used for a specific bucket
Directory: Array of 2^d bucket pointers
Buckets: Store actual data records

Directory Size = 2^{Global Depth}

Relationship

Local Depth ≤ Global Depth for all buckets

Extendable Hashing Structure

Example Structure (Global Depth = 2)

Directory (d=2)

00 → Bucket A

01 → Bucket B

10 → Bucket C

11 → Bucket B

Bucket A (d'=2)

Key: 4 (00)

Key: 12 (00)

Bucket B (d'=1)

Key: 5 (01)

Key: 7 (11)

Bucket C (d'=2)

Key: 10 (10)

Extendable Hashing: Operations

Search Operation

1. Apply hash function to search key
2. Use first d bits to index into directory
3. Follow pointer to appropriate bucket
4. Search sequentially within bucket

Insert Operation

1. Hash the key and locate target bucket
2. If bucket has space: Insert record
3. If bucket is full:
a) If local depth < global depth: Split bucket
b) If local depth = global depth: Double directory, then split bucket
4. Redistribute records based on additional bit

Delete Operation

1. Locate and remove the record
2. If bucket becomes empty, merge with sibling bucket
3. Reduce local depth if appropriate
4. Halve directory if possible (all local depths < global depth)

Extendable Hashing: Insert Example

Scenario: Inserting key 20 causing bucket split

Before Insert (d=2)

Bucket A (d'=2)

4 (100)

12 (1100)

FULL!

Problem: Inserting 20 (10100) → ends with "00"
Bucket A is full!

After Split (d=3)

Bucket A1 (d'=3)

4 (000)

20 (000)

Bucket A2 (d'=3)

12 (100)

Solution: Global depth = local depth
→ Double directory (d=3)
→ Split bucket using 3rd bit

Extendible Hashing: Student Names Example

Student Names & Hashes

Name	First Letter	Hash (binary)
Arun	A	000
Aditi	A	001
Aman	A	010
Bharat	B	100
Bina	B	101

Step 1-2: Initial Setup (GD=1)

Directory (1 bit)

0 → B0

1 → B1

B0 (LD=1)

[ ]

B1 (LD=1)

[ ]

Step 3: After Inserting Arun, Aditi

Directory (1 bit)

0 → B0

1 → B1

B0 (LD=1)

Arun (000)

Aditi (001)

B1 (LD=1)

[ ]

Aditi (001) is added to B0

Step 4-6: Directory Doubling (GD=2)

Directory (2 bits)

00 → B0

01 → B0

10 → B1

11 → B1

B0 (LD=1)

Arun (000)

Aditi (001)

Aman (010)

B1 (LD=1)

[ ]

Aman (010) causes split

Final State (After All Insertions)

Directory (GD=2)

00 → B0a

01 → B0b

10 → B1

11 → B1

B0a (LD=2)

Arun (000)

Aditi (001)

B0b (LD=2)

Aman (010)

B1 (LD=1)

Bharat (100)

Bina (101)

Key: GD=2 (Global Depth), LD=Local Depth

Linear Hashing

Concept

Dynamic hashing that grows one bucket at a time in a linear fashion
No directory required - more space efficient than extendable hashing
Uses multiple hash functions based on level

Key Components

Level (L): Current round of splitting
Next (N): Pointer to next bucket to split
Hash Functions: h_L(k) and h_L+1(k)
Split Threshold: Based on load factor

h_L(k) = k mod (2^L × N₀)
h_L+1(k) = k mod (2^L+1 × N₀)

where N₀ = initial number of buckets

Linear Hashing: How It Works

Split Trigger

Split occurs when load factor exceeds threshold (typically 0.7-0.8)
Bucket that caused overflow may NOT be the one that splits!
Split follows round-robin order from bucket 0

Split Process

1. Identify bucket N (Next pointer) to split
2. Create new bucket at end of file
3. Redistribute records from bucket N using h_L+1
4. Increment Next pointer
5. If all buckets in current level split → increment Level (L)

Search Strategy

1. Apply h_L(k) to get bucket address B
2. If B < Next: Bucket already split, use h_L+1(k)
3. If B ≥ Next: Bucket not yet split, use h_L(k)

Linear Hashing: Example

Initial State: L=0, N=0, N₀=4

Hash Functions:

h₀(k) = k mod 4

h₁(k) = k mod 8

Bucket 0 ← Next

4, 12, 20

Bucket 1

5, 13

Bucket 2

6, 14

Bucket 3

7, 15

↓ Insert 28 causes split of Bucket 0 ↓

Bucket 0 (split)

20 (h₁=4)

Bucket 1 ← Next

5, 13

Bucket 2

6, 14

Bucket 3

7, 15

Bucket 4 (new)

4, 12, 28

Extendable vs Linear Hashing

Aspect	Extendable Hashing	Linear Hashing
Directory	Required (can double in size)	Not required
Growth Pattern	Exponential (doubles)	Linear (one bucket at a time)
Split Trigger	Bucket overflow	Load factor threshold
Which Bucket Splits	Overflowing bucket	Next in round-robin order
Space Overhead	Directory space overhead	Minimal overhead
Performance	1-2 disk accesses	1-2 disk accesses + overflow
Complexity	Moderate	Higher (multiple hash functions)

Hash-Based Indexing: Trade-offs

✓ Advantages of Hash-Based Indexing

Fast Access: O(1) average case for equality searches
Simple Implementation: Straightforward hash function logic
Uniform Distribution: Good hash functions distribute data evenly
Dynamic Growth: Extendable and Linear hashing adapt to data size
No Reordering: Records don't need to maintain sorted order

✗ Disadvantages of Hash-Based Indexing

No Range Queries: Cannot efficiently support range searches (e.g., age > 25)
No Ordering: Cannot retrieve records in sorted order
Collision Overhead: Performance degrades with poor hash functions
Space Wastage: Empty buckets in static hashing
Complex Maintenance: Dynamic hashing requires split/merge operations

When to Use: Hash vs Tree Indexing

Use Hash-Based Indexing When:

Queries are primarily equality searches (exact match)
No requirement for ordered retrieval or sorting
No range queries needed
Fast single-record access is critical
Example: SELECT * FROM Students WHERE StudentID = 12345

Use Tree-Based Indexing (B+Tree) When:

Need to support range queries (e.g., salary BETWEEN 50000 AND 80000)
Ordered traversal required (e.g., ORDER BY)
Prefix matching (e.g., name LIKE 'John%')
Multi-attribute indexing with ordering
Example: SELECT * FROM Employees WHERE Age > 30 AND Age < 50

Key Decision: Equality Queries → Hash | Range/Order Queries → B+Tree

Collision Resolution Techniques

1. Chaining (Separate Chaining)

Each bucket maintains a linked list of all records that hash to it
Most common in database systems
Allows unlimited records per bucket (limited by memory)

2. Open Addressing

Linear Probing: Search sequentially for next empty slot
Quadratic Probing: Use quadratic function to find next slot
Double Hashing: Use second hash function for probing
Less common in database systems due to deletion complexity

3. Overflow Buckets

Primary bucket points to overflow area when full
Used in static hashing
Can create long overflow chains affecting performance

Good Hash Function Properties: Uniform Distribution + Low Collision Rate

Performance Analysis

Time Complexity Comparison

Operation	Hash Index	B+Tree Index
Equality Search	O(1) average	O(log n)
Range Search	O(n) - Not supported	O(log n + k)
Insert	O(1) average	O(log n)
Delete	O(1) average	O(log n)
Ordered Traversal	O(n log n) - Not supported	O(n)

Space Complexity

Static Hashing: O(N) where N = number of buckets
Extendable Hashing: O(2^d) for directory + O(N) for buckets
Linear Hashing: O(N) where N = current number of buckets
B+Tree: O(N) where N = number of nodes

Real-World Applications

Database Systems Using Hash Indexing

PostgreSQL: Hash indexes for equality comparisons
Oracle: Hash clusters for frequently accessed tables
MySQL: Hash indexes in MEMORY storage engine
NoSQL Databases: DynamoDB, Cassandra use consistent hashing

Common Use Cases

1. Primary Key Lookups: Student records by ID, Employee by SSN

2. Join Operations: Hash joins in query processing

3. Duplicate Detection: Finding duplicate records efficiently

4. Cache Implementation: Database buffer cache, query cache

5. Distributed Systems: Consistent hashing for data partitioning

Best Practices & Design Guidelines

Choosing Hash Function

Select hash function that provides uniform distribution
Avoid functions that create clustering
Common choices: Division method, Multiplication method, Universal hashing

Configuration Guidelines

Static Hashing: Estimate data growth and allocate 30-40% extra buckets

Extendable Hashing: Set appropriate global depth based on expected maximum records

Linear Hashing: Set load factor threshold between 0.7-0.8 for optimal performance

When NOT to Use Hashing

Primary access pattern involves range queries
Need sorted output frequently
Partial key searches required
Data has high skew (non-uniform distribution)

Summary: Hash-Based Indexing

Key Takeaways

1. Static Hashing: Simple but suffers from fixed size and overflow chains

2. Extendable Hashing: Uses directory structure, doubles when needed, handles overflow elegantly

3. Linear Hashing: No directory, grows linearly, uses multiple hash functions

4. Performance: O(1) for equality searches but cannot handle range queries

5. Choice: Hash for equality, B+Tree for range and ordering

Hash Indexing = Fast Equality Searches + Simple Implementation
- Range Query Support - Ordered Access

Understanding the trade-offs is key to choosing the right indexing technique!

Questions?

Hash-Based Indexing

Dr. Mohsin Dar

Assistant Professor

SOCS | UPES

Next: Tree-Based Indexing Deep Dive

Hash-Based Indexing

Database Systems - Unit III

Introduction to Hash-Based Indexing

What is Hashing?

Key Components

Hash Function Visualization

Example: h(k) = k mod 5

Static Hashing

Definition

Characteristics

Problems with Static Hashing

Major Limitations:

Solution Needed:

Extendable Hashing

Concept

Key Components

Relationship

Extendable Hashing Structure

Example Structure (Global Depth = 2)

Extendable Hashing: Operations

Search Operation

Insert Operation

Delete Operation

Extendable Hashing: Insert Example

Scenario: Inserting key 20 causing bucket split

Before Insert (d=2)

After Split (d=3)

Extendible Hashing: Student Names Example

Student Names & Hashes

Step 1-2: Initial Setup (GD=1)

Step 3: After Inserting Arun, Aditi

Step 4-6: Directory Doubling (GD=2)

Final State (After All Insertions)

Linear Hashing

Concept

Key Components

Linear Hashing: How It Works

Split Trigger

Split Process

Search Strategy

Linear Hashing: Example

Initial State: L=0, N=0, N0=4

Extendable vs Linear Hashing

Hash-Based Indexing: Trade-offs

✓ Advantages of Hash-Based Indexing

✗ Disadvantages of Hash-Based Indexing

When to Use: Hash vs Tree Indexing

Use Hash-Based Indexing When:

Use Tree-Based Indexing (B+Tree) When:

Collision Resolution Techniques

1. Chaining (Separate Chaining)

2. Open Addressing

3. Overflow Buckets

Performance Analysis

Time Complexity Comparison

Space Complexity

Real-World Applications

Database Systems Using Hash Indexing

Common Use Cases

Best Practices & Design Guidelines

Choosing Hash Function

Configuration Guidelines

When NOT to Use Hashing

Summary: Hash-Based Indexing

Key Takeaways

Understanding the trade-offs is key to choosing the right indexing technique!

Questions?

Hash-Based Indexing

Initial State: L=0, N=0, N₀=4