Unit III: Storage & Indexing
Database Systems - MTech First Semester
Cloud & Software Operations Cluster | SOCS | UPES
Dr. Mohsin Dar
Assistant Professor
File Organization refers to the way records are arranged and stored in a file on secondary storage devices.
The organization of data files impacts:
Records are stored in the order they are inserted - no particular ordering.
✓ Advantages: Fast insertion (O(1)), simple structure, no reorganization needed
✗ Disadvantages: Slow search (O(n) - linear scan), inefficient for queries
Records are stored in order based on a key field.
✓ Advantages: Efficient for range queries, binary search possible (O(log2n))
✗ Disadvantages: Expensive insertions/deletions, requires reorganization
Records are distributed across buckets using a hash function.
✓ Advantages: Fast exact-match queries
✗ Disadvantages: Poor for range queries, collisions
| Organization | Search | Insert | Delete | Range Query |
|---|---|---|---|---|
| Heap | O(n) | O(1) | O(n) | O(n) |
| Sequential | O(log n) | O(n) | O(n) | Efficient |
| Hash | O(1) avg | O(1) avg | O(1) avg | Poor |
Problem: Searching through millions of records sequentially is extremely slow!
Consider a database with 1 million records:
An index in a database is like an index in a book:
Just like a book's index helps you find topics quickly, a database index helps locate records efficiently.
Index: A data structure that improves the speed of data retrieval operations on a database table at the cost of additional storage space and slower writes.
An index consists of:
1. Primary Index
Built on the ordering key field of an ordered file. One index entry per data block.
2. Clustering Index
Built on non-key ordering field. Multiple records may have the same value.
3. Secondary Index
Built on any non-ordering field. Can be on key or non-key fields.
Primary Index: A sparse index on the ordering key field of a sequentially ordered file.
| Index Entry | Components |
|---|---|
| K(i) | Key value of first record in block i |
| P(i) | Pointer to block i |
Design and analyze the primary index for this file.
Blocking Factor (bfr): Number of records that can fit in one block
Calculation:
bfr = ⌊1,024 / 100⌋ = ⌊10.24⌋ = 10 records/block
Interpretation: Each data block can store exactly 10 records. The remaining 24 bytes in each block are wasted (internal fragmentation).
Number of Blocks: Total blocks needed to store all records
Calculation:
b = ⌈30,000 / 10⌉ = 3,000 blocks
Storage for Data File:
Total Space = 3,000 blocks × 1,024 bytes = 3,072,000 bytes ≈ 3 MB
Index Entry Components:
Calculation:
Index Entry Size = 9 + 6 = 15 bytes
Each index entry will store:
[ Key of First Record | Block Pointer ]
[ 9 bytes | 6 bytes ] = 15 bytes total
Index Blocking Factor: Number of index entries per block
Calculation:
bfri = ⌊1,024 / 15⌋ = ⌊68.27⌋ = 68 entries/block
Interpretation: Each index block can store 68 index entries.
Utilized space per index block: 68 × 15 = 1,020 bytes
Wasted space per index block: 1,024 - 1,020 = 4 bytes
Number of Index Entries: One entry per data block = 3,000 entries
Calculation:
bi = ⌈3,000 / 68⌉ = ⌈44.12⌉ = 45 index blocks
Storage for Index File:
Index Space = 45 blocks × 1,024 bytes = 46,080 bytes ≈ 45 KB
Average block accesses: b / 2 = 3,000 / 2 = 1,500 blocks
Worst case: 3,000 blocks
Index search: ⌈log2(45)⌉ = 6 block accesses
Data block access: 1 block access
Total: 6 + 1 = 7 block accesses
Speedup: 1,500 / 7 = 214× faster on average!
| Component | Size | Percentage |
|---|---|---|
| Data File | 3,072,000 bytes (≈3 MB) | 98.5% |
| Primary Index | 46,080 bytes (≈45 KB) | 1.5% |
| Total | 3,118,080 bytes | 100% |
Key Insight: The index uses only 1.5% additional space but provides 214× performance improvement for searches!
Space-Performance Tradeoff: This is an excellent tradeoff - minimal space overhead for massive performance gains.
Contains an index entry for every search key value in the data file.
Contains index entries for only some of the search key values.
• 30,000 records → 3,000 data blocks
• Primary index → 45 index blocks
• Search improvement: 1,500 accesses → 7 accesses (214× faster)
• Space overhead: Only 1.5% additional storage
Algorithm:
Time Complexity: O(log2 bi) + 1 block accesses
Algorithm:
Challenge: May require data reorganization
Algorithm:
Scenario: Search for employee with ID = 15,750
Index has 45 blocks with 68 entries each
Binary search iterations: ⌈log2(45)⌉ = 6 accesses
Find: Index entry with key ≤ 15,750
Follow pointer from index to data block
Block contains records starting with key 15,741
1 block access
Block has 10 records (bfr = 10)
Can use binary search or linear scan
Find employee record with ID = 15,750
Total: 7 block accesses vs 1,500 average without index!
Problem: If the index itself becomes very large (doesn't fit in memory), we need multiple disk accesses to search the index!
Create an index on the index!
Preview: We'll study B-Trees and B+ Trees in upcoming lectures - these are sophisticated multilevel index structures!
Best Practice: Analyze your query patterns and update frequencies before choosing an indexing strategy!
Ordered by Employee ID - natural ordering key for HR queries
Ordered by Transaction ID or Timestamp - audit trails and reporting
Ordered by Student ID - academic records and enrollment
Ordered by Product ID - stock tracking and ordering
Ordered by Timestamp - chronological event tracking
Primary index requires ordered data file. Random order won't work!
Block pointer size affects index entry size calculations. Use system-specific values.
Insertions and deletions require index updates. Plan for maintenance overhead.
Always use floor function for blocking factors - partial records don't fit!
Only ONE primary index per file (on ordering key). Use secondary indexes for other fields.
A university database has the following specifications:
Hint: Follow the same steps as our worked example!
Prepare for Next Class: Review primary index concepts and practice the calculation problems!
Dr. Mohsin Dar
Cloud & Software Operations Cluster
UPES