Lecture 14

Clustering and Secondary Indexing

Unit III: Storage & Indexing

Database Systems - MTech First Semester

Cloud & Software Operations Cluster | SOCS | UPES

Dr. Mohsin Dar
Assistant Professor

Learning Objectives

What We'll Learn Today

Understand different types of indexing structures
Learn about Clustering Index and its characteristics
Explore Secondary Index and its applications
Compare Dense vs Sparse indexing
Analyze index performance through practical calculations
Understand multilevel indexing concepts

Index Classification

Types of Indexing

INDEX

Single Level

Multilevel

Primary Index

Clustering Index

Secondary Index

Simple Multilevel

B-Tree

B+ Tree

Classification Overview

Single Level: One level of indirection between index and data
Multilevel: Multiple levels of indexes for large files

Recap: Primary Index

PRIMARY INDEXING

Key Characteristics

Main file is always sorted according to primary key.
Indexing is done on Primary Key, therefore called as primary indexing.
Index file has two columns: first primary key and second anchor pointer (base address of block).
It is an example of Sparse Indexing.
Here first record (anchor record) of every block gets an entry in the index file.
No. of entries in the index file = No. of blocks acquired by the main file.

Primary Index Structure

Index File (Sparse)
Primary Key (First Record)	Block Pointer
K1	→ Block 1
K10	→ Block 2
K20	→ Block 3
...	...

Clustering Index

CLUSTERED INDEXING

What is Clustering Index?

A clustering index is created when the data file is ordered on a non-key attribute that may have duplicate values.

Key Characteristics

Main file will be ordered on some non-key attributes.
No. of entries in the index file = No. of unique values of the attribute on which indexing is done.
It is an example of Sparse as well as Dense indexing.
One index entry per distinct value of the clustering field.
Records with the same clustering field value are grouped together.

Clustering Index Example

Clustering Index: Practical Example

Scenario: Employee Database

Consider an employee database ordered by Department_ID (non-key field):

Employee_ID	Name	Department_ID	Salary
101	Alice	10	50000
105	Bob	10	55000
102	Charlie	20	60000
108	David	20	62000
103	Eve	30	58000

Clustering Index on Department_ID

Department_ID	Block Pointer
10	→ Block 1
20	→ Block 2
30	→ Block 3

Note: Only 3 index entries for potentially hundreds of employee records!

Clustering Index Benefits

Advantages and Disadvantages

Advantages:

✓ Efficient for range queries on clustering field
✓ Reduces number of index entries (sparse)
✓ Groups related records together (better locality)
✓ Faster retrieval for queries on clustering attribute

Disadvantages:

✗ Data file must be ordered on non-key field
✗ Insertions/deletions are expensive
✗ Only one clustering index per file
✗ Requires periodic reorganization

Secondary Index

SECONDARY INDEXING

What is Secondary Index?

A secondary index provides an alternate access path to the data file. It can be created on any field (key or non-key) that is different from the primary ordering field.

Key Characteristics

Most common scenario: Suppose we already have a primary indexing on primary key, but there is frequent query on some other attributes, so we may decide to have one more index file with some other attribute.
Main file is unordered according to the attribute on which secondary indexing is done.
Secondary indexing can be done on key or non-key attribute.
No. of entries in the index file is same as the number of entries in the main file.
It is an example of dense indexing.

Secondary Index Example

Secondary Index: Practical Scenario

Scenario: Employee Database with Multiple Access Patterns

Primary Index: On Employee_ID (primary key)

Secondary Index: On Salary (for frequent salary-based queries)

Main Data File (Ordered by Employee_ID)

Emp_ID	Name	Salary
101	Alice	50000
102	Bob	60000
103	Charlie	55000
104	David	50000

Secondary Index on Salary

Salary	Record Pointer
50000	→ Record 1 (Alice)
50000	→ Record 4 (David)
55000	→ Record 3 (Charlie)
60000	→ Record 2 (Bob)

Note: Secondary index has one entry for each record (Dense Index)

Practical Problem

Secondary Index: Problem Statement

Given:

Number of records: r = 30,000
Block size: B = 1,024 bytes
Record length: R = 100 bytes
Ordering key field length: V = 9 bytes
Record pointer size: P = 6 bytes
Records are fixed-size and unspanned
File is unordered

Find:

Design and analyze the secondary index for this file.

Solution - Part 1

Step 1: Calculate Blocking Factor (bfr)

Blocking Factor (bfr): Number of records that can fit in one block

bfr = ⌊Block Size / Record Size⌋ = ⌊B / R⌋

Calculation:

bfr = ⌊1,024 / 100⌋ = ⌊10.24⌋ = 10 records/block

Interpretation: Each data block can store exactly 10 records.

Solution - Part 2

Step 2: Calculate Number of Blocks (b)

Number of Blocks: Total blocks needed to store all records

b = ⌈Total Records / Blocking Factor⌉ = ⌈r / bfr⌉

Calculation:

b = ⌈30,000 / 10⌉ = 3,000 blocks

Storage for Data File:

Total Space = 3,000 blocks × 1,024 bytes = 3,072,000 bytes ≈ 3 MB

Solution - Part 3

Step 3: Calculate Index Entry Size

Secondary Index Entry Components:

Search Key Value: V = 9 bytes
Record Pointer (not block pointer): P = 6 bytes

Index Entry Size = V + P

Calculation:

Index Entry Size = 9 + 6 = 15 bytes

Important Difference: Secondary index uses record pointers (pointing to individual records), not block pointers. This is because the file is unordered.

Solution - Part 4

Step 4: Calculate Index Blocking Factor (bfr_i)

Index Blocking Factor: Number of index entries per block

bfr_i = ⌊Block Size / Index Entry Size⌋ = ⌊B / (V + P)⌋

Calculation:

bfr_i = ⌊1,024 / 15⌋ = ⌊68.27⌋ = 68 entries/block

Interpretation: Each index block can store 68 index entries.

Solution - Part 5

Step 5: Calculate Number of Index Blocks (b_i)

Number of Index Entries: One entry per record = 30,000 entries (Dense Index)

b_i = ⌈Number of Index Entries / bfr_i⌉ = ⌈r / bfr_i⌉

Calculation:

b_i = ⌈30,000 / 68⌉ = ⌈441.18⌉ = 442 index blocks

Storage for Secondary Index File:

Index Space = 442 blocks × 1,024 bytes = 452,608 bytes ≈ 442 KB

Performance Analysis

Search Performance Comparison

Without Secondary Index (Linear Search):

Average block accesses: b / 2 = 3,000 / 2 = 1,500 blocks

Worst case: 3,000 blocks

With Secondary Index (Binary Search on Index):

Index search: ⌈log₂(442)⌉ = 9 block accesses

Data record access: 1 block access

Total: 9 + 1 = 10 block accesses

Performance Improvement

Speedup: 1,500 / 10 = 150× faster on average!

Index Comparison

Primary vs Clustering vs Secondary Index

Characteristic	Primary Index	Clustering Index	Secondary Index
Ordering Field	Primary Key	Non-key attribute	Any field (key/non-key)
File Organization	Ordered (sorted)	Ordered (sorted)	Unordered
Index Type	Sparse	Sparse/Dense	Dense
No. of Entries	= No. of blocks	= No. of unique values	= No. of records
Pointer Type	Block pointer	Block pointer	Record pointer
Multiple Allowed	No (only one)	No (only one)	Yes (multiple)
Space Overhead	Low	Low	High

Dense vs Sparse Indexing

Dense Index

In dense index, there is an entry in the index file for every search key value in the main file. This makes searching faster but requires more space to store index records itself.

One entry per search key value
Larger space requirement
Can determine record existence from index alone
Note: It is not for every record, it is for every search key value. Sometimes number of records in the main file > number of search keys in the main file, for example if search key is repeated.

Sparse Index

If an index entry is created only for some records of the main file, then it is called sparse index.

No. of index entries in the index file < No. of records in the main file
Smaller space requirement
Data file must be ordered
Typically one entry per block

Note: Dense and sparse are not complementary to each other. Sometimes it is possible that an index is both dense and sparse (e.g., clustering index).

Visual Comparison

Dense vs Sparse: Visual Representation

Sparse Index

Index (Sparse)
10	→ Block 1
30	→ Block 2
50	→ Block 3

Only first record of each block indexed

Dense Index

Index (Dense)
10	→ Record 1
15	→ Record 2
20	→ Record 3
30	→ Record 4
...	...

Every record/search key indexed

Summary

Key Takeaways

Primary Index: Sparse index on ordered file's primary key field
Clustering Index: Index on ordered file's non-key field with duplicate values
Secondary Index: Dense index on any field, file can be unordered
Dense index has entry for every search key value
Sparse index has entries only for some records (typically per block)
Multiple secondary indexes possible, but only one primary/clustering index

Performance Trade-offs:

Space: Sparse < Dense
Search Speed: All provide logarithmic search time
Update Cost: Primary/Clustering > Secondary
Flexibility: Secondary > Primary/Clustering

Implementation Guidelines

When to Use Each Index Type

Use Primary Index When:

Data is naturally ordered by primary key
Frequent searches on primary key
Space efficiency is important
Update frequency is low

Use Clustering Index When:

Frequent queries on non-key attribute with few distinct values
Range queries on clustering field are common
Records with same value should be grouped together
Example: Grouping employees by department

Use Secondary Index When:

Need multiple access paths to data
Frequent queries on non-ordering fields
File cannot be reordered
Example: Searching by email, phone, or other attributes

Clustering and Secondary Indexing

What We'll Learn Today

Types of Indexing

Classification Overview

PRIMARY INDEXING

Key Characteristics

Primary Index Structure

CLUSTERED INDEXING

What is Clustering Index?

Key Characteristics

Clustering Index: Practical Example

Scenario: Employee Database

Clustering Index on Department_ID

Advantages and Disadvantages

Advantages:

Disadvantages:

SECONDARY INDEXING

What is Secondary Index?

Key Characteristics

Secondary Index: Practical Scenario

Scenario: Employee Database with Multiple Access Patterns

Main Data File (Ordered by Employee_ID)

Secondary Index on Salary

Secondary Index: Problem Statement

Given:

Find:

Step 1: Calculate Blocking Factor (bfr)

Step 2: Calculate Number of Blocks (b)

Step 3: Calculate Index Entry Size

Step 4: Calculate Index Blocking Factor (bfri)

Step 5: Calculate Number of Index Blocks (bi)

Search Performance Comparison

Without Secondary Index (Linear Search):

With Secondary Index (Binary Search on Index):

Performance Improvement

Primary vs Clustering vs Secondary Index

Dense vs Sparse Indexing

Dense Index

Sparse Index

Dense vs Sparse: Visual Representation

Sparse Index

Dense Index

Key Takeaways

Performance Trade-offs:

When to Use Each Index Type

Use Primary Index When:

Use Clustering Index When:

Use Secondary Index When:

Step 4: Calculate Index Blocking Factor (bfr_i)

Step 5: Calculate Number of Index Blocks (b_i)