Lecture 14

Clustering and Secondary Indexing

Unit III: Storage & Indexing

Database Systems - MTech First Semester

Cloud & Software Operations Cluster | SOCS | UPES

Dr. Mohsin Dar
Assistant Professor

Learning Objectives

What We'll Learn Today

  • Understand different types of indexing structures
  • Learn about Clustering Index and its characteristics
  • Explore Secondary Index and its applications
  • Compare Dense vs Sparse indexing
  • Analyze index performance through practical calculations
  • Understand multilevel indexing concepts
Index Classification

Types of Indexing

INDEX
Single Level
Multilevel
Primary Index
Clustering Index
Secondary Index
Simple Multilevel
B-Tree
B+ Tree

Classification Overview

  • Single Level: One level of indirection between index and data
  • Multilevel: Multiple levels of indexes for large files
Recap: Primary Index

PRIMARY INDEXING

Key Characteristics

  • Main file is always sorted according to primary key.
  • Indexing is done on Primary Key, therefore called as primary indexing.
  • Index file has two columns: first primary key and second anchor pointer (base address of block).
  • It is an example of Sparse Indexing.
  • Here first record (anchor record) of every block gets an entry in the index file.
  • No. of entries in the index file = No. of blocks acquired by the main file.

Primary Index Structure

Index File (Sparse)
Primary Key (First Record) Block Pointer
K1 → Block 1
K10 → Block 2
K20 → Block 3
... ...
Clustering Index

CLUSTERED INDEXING

What is Clustering Index?

A clustering index is created when the data file is ordered on a non-key attribute that may have duplicate values.

Key Characteristics

  • Main file will be ordered on some non-key attributes.
  • No. of entries in the index file = No. of unique values of the attribute on which indexing is done.
  • It is an example of Sparse as well as Dense indexing.
  • One index entry per distinct value of the clustering field.
  • Records with the same clustering field value are grouped together.
Clustering Index Example

Clustering Index: Practical Example

Scenario: Employee Database

Consider an employee database ordered by Department_ID (non-key field):

Employee_ID Name Department_ID Salary
101 Alice 10 50000
105 Bob 10 55000
102 Charlie 20 60000
108 David 20 62000
103 Eve 30 58000

Clustering Index on Department_ID

Department_ID Block Pointer
10 → Block 1
20 → Block 2
30 → Block 3

Note: Only 3 index entries for potentially hundreds of employee records!

Clustering Index Benefits

Advantages and Disadvantages

Advantages:

  • ✓ Efficient for range queries on clustering field
  • ✓ Reduces number of index entries (sparse)
  • ✓ Groups related records together (better locality)
  • ✓ Faster retrieval for queries on clustering attribute

Disadvantages:

  • ✗ Data file must be ordered on non-key field
  • ✗ Insertions/deletions are expensive
  • ✗ Only one clustering index per file
  • ✗ Requires periodic reorganization
Secondary Index

SECONDARY INDEXING

What is Secondary Index?

A secondary index provides an alternate access path to the data file. It can be created on any field (key or non-key) that is different from the primary ordering field.

Key Characteristics

  • Most common scenario: Suppose we already have a primary indexing on primary key, but there is frequent query on some other attributes, so we may decide to have one more index file with some other attribute.
  • Main file is unordered according to the attribute on which secondary indexing is done.
  • Secondary indexing can be done on key or non-key attribute.
  • No. of entries in the index file is same as the number of entries in the main file.
  • It is an example of dense indexing.
Secondary Index Example

Secondary Index: Practical Scenario

Scenario: Employee Database with Multiple Access Patterns

Primary Index: On Employee_ID (primary key)

Secondary Index: On Salary (for frequent salary-based queries)

Main Data File (Ordered by Employee_ID)

Emp_ID Name Salary
101 Alice 50000
102 Bob 60000
103 Charlie 55000
104 David 50000

Secondary Index on Salary

Salary Record Pointer
50000 → Record 1 (Alice)
50000 → Record 4 (David)
55000 → Record 3 (Charlie)
60000 → Record 2 (Bob)

Note: Secondary index has one entry for each record (Dense Index)

Practical Problem

Secondary Index: Problem Statement

Given:

  • Number of records: r = 30,000
  • Block size: B = 1,024 bytes
  • Record length: R = 100 bytes
  • Ordering key field length: V = 9 bytes
  • Record pointer size: P = 6 bytes
  • Records are fixed-size and unspanned
  • File is unordered

Find:

Design and analyze the secondary index for this file.

Solution - Part 1

Step 1: Calculate Blocking Factor (bfr)

Blocking Factor (bfr): Number of records that can fit in one block

bfr = ⌊Block Size / Record Size⌋ = ⌊B / R⌋

Calculation:

bfr = ⌊1,024 / 100⌋ = ⌊10.24⌋ = 10 records/block

Interpretation: Each data block can store exactly 10 records.

Solution - Part 2

Step 2: Calculate Number of Blocks (b)

Number of Blocks: Total blocks needed to store all records

b = ⌈Total Records / Blocking Factor⌉ = ⌈r / bfr⌉

Calculation:

b = ⌈30,000 / 10⌉ = 3,000 blocks

Storage for Data File:

Total Space = 3,000 blocks × 1,024 bytes = 3,072,000 bytes ≈ 3 MB

Solution - Part 3

Step 3: Calculate Index Entry Size

Secondary Index Entry Components:

Index Entry Size = V + P

Calculation:

Index Entry Size = 9 + 6 = 15 bytes

Important Difference: Secondary index uses record pointers (pointing to individual records), not block pointers. This is because the file is unordered.

Solution - Part 4

Step 4: Calculate Index Blocking Factor (bfri)

Index Blocking Factor: Number of index entries per block

bfri = ⌊Block Size / Index Entry Size⌋ = ⌊B / (V + P)⌋

Calculation:

bfri = ⌊1,024 / 15⌋ = ⌊68.27⌋ = 68 entries/block

Interpretation: Each index block can store 68 index entries.

Solution - Part 5

Step 5: Calculate Number of Index Blocks (bi)

Number of Index Entries: One entry per record = 30,000 entries (Dense Index)

bi = ⌈Number of Index Entries / bfri⌉ = ⌈r / bfri

Calculation:

bi = ⌈30,000 / 68⌉ = ⌈441.18⌉ = 442 index blocks

Storage for Secondary Index File:

Index Space = 442 blocks × 1,024 bytes = 452,608 bytes ≈ 442 KB

Performance Analysis

Search Performance Comparison

Without Secondary Index (Linear Search):

Average block accesses: b / 2 = 3,000 / 2 = 1,500 blocks

Worst case: 3,000 blocks

With Secondary Index (Binary Search on Index):

Index search: ⌈log2(442)⌉ = 9 block accesses

Data record access: 1 block access

Total: 9 + 1 = 10 block accesses

Performance Improvement

Speedup: 1,500 / 10 = 150× faster on average!

Index Comparison

Primary vs Clustering vs Secondary Index

Characteristic Primary Index Clustering Index Secondary Index
Ordering Field Primary Key Non-key attribute Any field (key/non-key)
File Organization Ordered (sorted) Ordered (sorted) Unordered
Index Type Sparse Sparse/Dense Dense
No. of Entries = No. of blocks = No. of unique values = No. of records
Pointer Type Block pointer Block pointer Record pointer
Multiple Allowed No (only one) No (only one) Yes (multiple)
Space Overhead Low Low High
Dense vs Sparse Indexing

Dense vs Sparse Indexing

Dense Index

In dense index, there is an entry in the index file for every search key value in the main file. This makes searching faster but requires more space to store index records itself.

  • One entry per search key value
  • Larger space requirement
  • Can determine record existence from index alone
  • Note: It is not for every record, it is for every search key value. Sometimes number of records in the main file > number of search keys in the main file, for example if search key is repeated.

Sparse Index

If an index entry is created only for some records of the main file, then it is called sparse index.

  • No. of index entries in the index file < No. of records in the main file
  • Smaller space requirement
  • Data file must be ordered
  • Typically one entry per block

Note: Dense and sparse are not complementary to each other. Sometimes it is possible that an index is both dense and sparse (e.g., clustering index).

Visual Comparison

Dense vs Sparse: Visual Representation

Sparse Index

Index (Sparse)
10 → Block 1
30 → Block 2
50 → Block 3

Only first record of each block indexed

Dense Index

Index (Dense)
10 → Record 1
15 → Record 2
20 → Record 3
30 → Record 4
... ...

Every record/search key indexed

Summary

Key Takeaways

  • Primary Index: Sparse index on ordered file's primary key field
  • Clustering Index: Index on ordered file's non-key field with duplicate values
  • Secondary Index: Dense index on any field, file can be unordered
  • Dense index has entry for every search key value
  • Sparse index has entries only for some records (typically per block)
  • Multiple secondary indexes possible, but only one primary/clustering index

Performance Trade-offs:

  • Space: Sparse < Dense
  • Search Speed: All provide logarithmic search time
  • Update Cost: Primary/Clustering > Secondary
  • Flexibility: Secondary > Primary/Clustering
Implementation Guidelines

When to Use Each Index Type

Use Primary Index When:

  • Data is naturally ordered by primary key
  • Frequent searches on primary key
  • Space efficiency is important
  • Update frequency is low

Use Clustering Index When:

  • Frequent queries on non-key attribute with few distinct values
  • Range queries on clustering field are common
  • Records with same value should be grouped together
  • Example: Grouping employees by department

Use Secondary Index When:

  • Need multiple access paths to data
  • Frequent queries on non-ordering fields
  • File cannot be reordered
  • Example: Searching by email, phone, or other attributes