Query Optimization

Database Systems - Lecture 12

Unit II: Transaction Management

M.Tech - First Semester

Database Systems (MTech Semester I)

Dr. Mohsin Dar

Assistant Professor, Cloud & Software Operations Cluster

UPES

Slide 2

Lecture Outline

Introduction to Query Optimization
Cardinality Estimation
- Selection, Projection, Join Operations
- Set Operations and Aggregation
Histograms for Selectivity Estimation
Equivalences of Relational Algebra
Query Plans and Execution Strategies
Cost Estimation Models
Nested Query Optimization
Numerical Practice Problems

Slide 3

Introduction to Query Optimization

What is Query Optimization?

Query optimization is the process of selecting the most efficient execution plan for a given SQL query from multiple possible plans.

Goals of Query Optimization:

Minimize response time - Reduce query execution time
Minimize resource consumption - Reduce CPU, I/O, memory usage
Maximize throughput - Handle more queries concurrently

Query Optimization Process:

Parse SQL query
Generate query tree
Apply algebraic transformations
Enumerate possible plans

Estimate cost for each plan
Select optimal plan
Execute selected plan

Slide 4

Cardinality Estimation

Definition:

Cardinality refers to the number of tuples (rows) in the result of a relational operation.

Why is Cardinality Estimation Important?

Determines cost of query execution plans
Helps choose optimal join algorithms
Guides index selection decisions
Critical for cost-based optimization

Key Statistics Maintained:

Statistic	Symbol	Description
Cardinality	n_r	Number of tuples in relation r
Block Size	b_r	Number of blocks containing r
Tuple Size	l_r	Size of a tuple in relation r
Distinct Values	V(A,r)	Number of distinct values for attribute A

Slide 5

Cardinality Estimation: Selection (σ)

Selection Operation: σ_predicate(r)

Case 1: Equality Condition (A = v)

n_{σ_A=v(r)} = n_r / V(A,r)

Assumption: Values are uniformly distributed

Case 2: Comparison Condition (A ≤ v)

n_{σ_A≤v(r)} = n_r × (v - min(A,r)) / (max(A,r) - min(A,r))

Assumption: Values are uniformly distributed within range

Case 3: Conjunction (cond₁ ∧ cond₂)

n_{σ_c1∧c2(r)} = n_r × s₁ × s₂

where s_i is the selectivity factor of condition i

Assumption: Conditions are independent

Case 4: Disjunction (cond₁ ∨ cond₂)

n_{σ_c1∨c2(r)} = n_r × (1 - (1-s₁) × (1-s₂))

Slide 6

Cardinality Estimation: Join Operations

Natural Join: r ⋈ s

Case 1: Join on Key of r (R ∩ S = primary key of r)

n_r⋈s ≤ n_s

Each tuple in s joins with at most one tuple in r

Case 2: Foreign Key Join (R ∩ S is foreign key from s to r)

n_r⋈s = n_s

Each tuple in s joins with exactly one tuple in r

Case 3: General Case (No key constraints)

n_r⋈s = (n_r × n_s) / max(V(A,r), V(A,s))

where A is the common attribute(s)

Worst Case (Cartesian Product): n_r × n_s

Example:

If Student has 1000 tuples, Enrollment has 5000 tuples, and V(student_id, Student) = 1000:

n_{Student⋈Enrollment} = (1000 × 5000) / 1000 = 5000

Slide 7

Cardinality Estimation: Other Operations

Projection: π_A(r)

n_{π_A(r)} ≤ min(V(A,r), n_r)

If A is a key: n_{π_A(r)} = n_r

If duplicates removed: n_{π_A(r)} = V(A,r)

Set Operations:

Operation	Cardinality Estimate
Union: r ∪ s	n_r + n_s (upper bound)
Intersection: r ∩ s	min(n_r, n_s)
Difference: r - s	n_r (upper bound)

Aggregation: _Aγ_F(B)(r)

Group by A with aggregate function F on B

n_{_Aγ_F(B)(r)} = V(A,r)

Result has one tuple per distinct value of grouping attributes

Slide 8

Histograms for Selectivity Estimation

Why Histograms?

The uniform distribution assumption is often incorrect in real data. Histograms provide more accurate estimates.

Types of Histograms:

Equi-width Histogram: Divide domain into equal-sized buckets
Equi-depth (Equi-height) Histogram: Each bucket contains approximately same number of tuples
Compressed Histogram: Store only non-empty buckets
End-biased Histogram: Separate treatment for high-frequency values

Equi-depth Histogram Example:

Bucket	Range	Count	Distinct Values
1	[1-25]	500	20
2	[26-45]	500	15
3	[46-75]	500	25
4	[76-100]	500	18

Usage: For query σ_age=30(Person), look up bucket 2: estimate ≈ 500/15 ≈ 33 tuples

Slide 9

Equivalences of Relational Algebra (Part 1)

Why Equivalences Matter?

Multiple equivalent expressions can produce the same result with different costs. Query optimizers use these rules to transform queries.

1. Selection Operations:

Cascade: σ_c1∧c2(r) ≡ σ_c1(σ_c2(r))

Commutativity: σ_c1(σ_c2(r)) ≡ σ_c2(σ_c1(r))

2. Projection Operations:

Cascade: π_L1(π_L2(...π_Ln(r))) ≡ π_L1(r)

if L1 ⊆ L2 ⊆ ... ⊆ Ln

3. Join Operations:

Commutativity: r ⋈ s ≡ s ⋈ r

Associativity: (r ⋈ s) ⋈ t ≡ r ⋈ (s ⋈ t)

These properties allow reordering joins for efficiency

Slide 10

Equivalences of Relational Algebra (Part 2)

4. Selection with Join (Push Selection):

σ_c(r ⋈ s) ≡ σ_c(r) ⋈ s

if condition c involves only attributes of r

Benefit: Reduces intermediate result size

5. Projection with Join (Push Projection):

π_L(r ⋈_c s) ≡ π_L(π_L1(r) ⋈_c π_L2(s))

where L1 contains attributes from r needed for join and output

and L2 contains attributes from s needed for join and output

6. Set Operations:

Union: r ∪ s ≡ s ∪ r (commutative)

Union: (r ∪ s) ∪ t ≡ r ∪ (s ∪ t) (associative)

Intersection: r ∩ s ≡ s ∩ r (commutative)

Selection with Union: σ_c(r ∪ s) ≡ σ_c(r) ∪ σ_c(s)

Slide 11

Query Plans

What is a Query Plan?

A query plan (or query execution plan) is a tree-like data structure representing the sequence of operations to execute a query.

Components of a Query Plan:

Operators: Selection, Projection, Join, etc.
Access Methods: Sequential scan, index scan, etc.
Join Algorithms: Nested loop, hash join, merge join
Data Flow: Direction from leaves to root

Example Query:

SELECT S.name, E.grade
FROM Student S, Enrollment E
WHERE S.student_id = E.student_id
  AND S.dept = 'CS'
  AND E.grade > 3.5;
            

Two Possible Query Plans:

Plan A (Less Efficient):

Cartesian Product: Student × Enrollment
Selection: σ_conditions
Projection: π_name,grade

Plan B (More Efficient):

Selection: σ_dept='CS'(Student)
Selection: σ_grade>3.5(Enrollment)
Join: result1 ⋈ result2
Projection: π_name,grade

Slide 12

Cost Estimation

What is Cost?

Cost is a measure of resources consumed by a query execution plan.

Cost Factors:

I/O Cost: Number of disk block transfers (dominant factor)
CPU Cost: Time for in-memory operations
Memory Cost: Buffer space required
Network Cost: Data transfer in distributed systems

Simplified Model: Cost = b_T × t_T + b_S × t_S

where b_T = blocks transferred, t_T = transfer time per block,

b_S = seeks, t_S = seek time

Cost Model Assumptions:

Relations stored on disk
Limited buffer space available
I/O dominates cost (CPU cost often ignored)
Sequential I/O cheaper than random I/O

Slide 13

Cost Estimation for Operations

Selection: σ_A=v(r)

Method	Cost (blocks)
Linear Search	b_r
Binary Search (sorted)	⌈log₂(b_r)⌉ + ⌈n_σ/b_fr⌉ - 1
Primary Index (key)	h_i + 1
Secondary Index (key)	h_i + 1
Secondary Index (non-key)	h_i + n_σ

where h_i = index height, b_fr = blocking factor

Join: r ⋈ s (Cost depends on algorithm)

Algorithm	Cost (blocks)
Nested Loop	b_r + (n_r × b_s)
Block Nested Loop	b_r + (b_r × b_s)
Indexed Nested Loop	b_r + (n_r × c)
Merge Join	b_r + b_s + sort cost
Hash Join	3(b_r + b_s)

where c = cost to access s using index per tuple of r

Slide 14

Nested Queries

What are Nested Queries?

A nested query (or subquery) is a query within another query, typically appearing in the WHERE or HAVING clause.

Types of Nested Queries:

1. Uncorrelated (Independent) Subquery:

SELECT name FROM Student
WHERE dept_id IN (
    SELECT dept_id FROM Department
    WHERE location = 'Delhi'
);
                

Inner query executes once, independent of outer query

2. Correlated Subquery:

SELECT name FROM Student S
WHERE salary > (
    SELECT AVG(salary) FROM Student
    WHERE dept = S.dept
);
                

Inner query references outer query variables; executes for each outer tuple

Slide 15

Nested Query Optimization

Challenges:

Correlated subqueries can be very expensive (O(n²) complexity)
Nested queries may prevent use of efficient join algorithms
Optimizer has limited visibility into nested structure

Optimization Strategies:

1. Subquery Decorrelation (Unnesting):

Transform correlated subquery into a join

Before (Correlated):

SELECT S.name FROM Student S
WHERE EXISTS (
    SELECT * FROM Enrollment E
    WHERE E.student_id = S.student_id
      AND E.grade > 3.5
);
                

After (Join):

SELECT DISTINCT S.name
FROM Student S, Enrollment E
WHERE S.student_id = E.student_id
  AND E.grade > 3.5;
                

Slide 16

Nested Query Optimization (Continued)

2. Materialization:

Execute inner query once, store result, use for all outer tuples

Works well for uncorrelated subqueries
Reduces repeated computation
Requires space for materialized result

3. Semi-Join Execution:

For EXISTS/IN queries, stop searching inner relation once match found

4. Index Usage:

Create temporary index on subquery result if used multiple times

Example Transformation:

Query with NOT IN:

SELECT name FROM Student
WHERE student_id NOT IN (
    SELECT student_id FROM Enrollment
);
            

Optimized as Anti-Join:

SELECT S.name
FROM Student S LEFT OUTER JOIN Enrollment E
  ON S.student_id = E.student_id
WHERE E.student_id IS NULL;
            

Slide 17

Complete Optimization Example

Given Query:

SELECT P.name, D.dept_name
FROM Professor P, Department D, Course C
WHERE P.dept_id = D.dept_id
  AND P.prof_id = C.prof_id
  AND D.location = 'Delhi'
  AND C.credits > 3;
            

Step 1: Initial Query Tree (Cartesian Products)

π_{P.name, D.dept_name}
σ_{P.dept_id=D.dept_id ∧ P.prof_id=C.prof_id ∧ D.location='Delhi' ∧ C.credits>3}
Professor × Department × Course

Cost: Very high due to Cartesian products

Step 2: Push Selections Down

Apply σ_{location='Delhi'} to Department
Apply σ_credits>3 to Course
Reduces size of intermediate results

Slide 18

Complete Optimization Example (Continued)

Step 3: Replace Cartesian Products with Joins

Transform to:

π_{P.name, D.dept_name}
(σ_{location='Delhi'}(Department)) ⋈_{dept_id} Professor ⋈_{prof_id} (σ_credits>3(Course))

Step 4: Choose Join Order

Statistics:

Department: 50 tuples, Delhi filter → 5 tuples
Professor: 500 tuples
Course: 1000 tuples, credits filter → 400 tuples

Option A: (Department ⋈ Professor) ⋈ Course

Cost: ~5×500 + result×400 ≈ 2500 + 50×400 = 22,500

Option B: (Professor ⋈ Course) ⋈ Department

Cost: ~500×400 + result×5 = 200,000 + (large)×5

Best Choice: Option A (start with smallest relation)

Slide 19

Join Ordering: Dynamic Programming

Problem:

For n relations, there are O((2n)!/(n!)) possible join orders. For 10 relations: ~17 billion possibilities!

System-R Style Dynamic Programming Algorithm:

Step 1: Consider all single relations (base case)
Step 2: Consider all ways to join 2 relations
Step 3: Consider all ways to join 3 relations using results from Step 2
Continue... until all n relations joined

For each subset, keep only the cheapest plan

Complexity: O(3ⁿ) - much better than factorial!

Example: Join Order for {R, S, T}

Size	Subsets Considered	Best Plan Stored
1	{R}, {S}, {T}	3 plans
2	{R,S}, {R,T}, {S,T}	3 plans
3	{R,S,T}	1 plan

Slide 20

Heuristic-Based Optimization

Common Heuristics:

Perform selections as early as possible
- Reduces cardinality of intermediate results
Perform projections early
- Reduces tuple size
Perform most restrictive selections first
- Smallest intermediate results
Avoid Cartesian products when possible
- Replace with joins when feasible
Perform joins with smallest results first
- Left-deep trees preferred for pipelining
Use indexes when available
- Especially for selections and joins

Note: Heuristics provide good plans quickly but may not find optimal plan

Slide 21

Summary

Key Takeaways:

Cardinality Estimation: Critical for cost-based optimization; uses statistics and histograms for accuracy
Histograms: Provide better selectivity estimates than uniform distribution assumption
Relational Algebra Equivalences: Enable query transformations; push selections and projections down
Query Plans: Multiple equivalent plans with vastly different costs
Cost Estimation: I/O cost dominates; depends on access methods and join algorithms
Nested Queries: Can be optimized through decorrelation, materialization, or semi-joins
Join Ordering: Dynamic programming finds optimal order efficiently

Optimization Process:

Parse query into logical plan
Apply heuristic transformations
Generate physical plans with different algorithms
Estimate cost using statistics
Select cheapest plan for execution

Slide 22

Numerical Practice Problems

Database Schema for Problems:

Student(student_id, name, dept, age)

n_Student = 10,000 tuples
b_Student = 500 blocks
V(dept, Student) = 20 distinct departments
V(age, Student) = 50 distinct ages (18-67)

Enrollment(student_id, course_id, grade)

n_Enrollment = 50,000 tuples
b_Enrollment = 2000 blocks
V(course_id, Enrollment) = 500 distinct courses
V(grade, Enrollment) = 11 distinct grades (0.0-4.0 in 0.5 steps)

Course(course_id, title, credits, dept)

n_Course = 500 tuples
b_Course = 50 blocks
V(credits, Course) = 4 distinct values (1,2,3,4)

Slide 23

Problem 1: Cardinality Estimation (Selection)

Problem 1A:

Estimate the cardinality of: σ_dept='CS'(Student)

Solution:

Using uniform distribution assumption:

n_σ = n_Student / V(dept, Student) = 10,000 / 20 = 500 tuples

Problem 1B:

Estimate the cardinality of: σ_age≤25(Student)

Solution:

Age range: 18-67 (total range = 49)

Values ≤ 25: 18-25 (range = 7)

n_σ = n_Student × (7/49) = 10,000 × (7/49) ≈ 1,429 tuples

Problem 1C:

Estimate the cardinality of: σ_{dept='CS' ∧ age≤25}(Student)

Solution:

Assuming independence:

n_σ = n_Student × (1/20) × (7/49) = 10,000 × 0.05 × 0.143 ≈ 71 tuples

Slide 24

Problem 2: Cardinality Estimation (Join)

Problem 2A:

Estimate the cardinality of: Student ⋈_{student_id} Enrollment

(Assume student_id is a foreign key in Enrollment referencing Student)

Solution:

Foreign key join - each enrollment tuple joins with exactly one student:

n_join = n_Enrollment = 50,000 tuples

Problem 2B:

Estimate the cardinality of: Enrollment ⋈_{course_id} Course

(Assume course_id is a foreign key in Enrollment referencing Course)

Solution:

Foreign key join:

n_join = n_Enrollment = 50,000 tuples

Problem 2C:

If we join all three: Student ⋈ Enrollment ⋈ Course, what is the estimated cardinality?

Solution:

Since both joins are on foreign keys:

n_join = n_Enrollment = 50,000 tuples

Slide 25

Problem 3: Cost Estimation (Selection)

Problem 3A:

Calculate the cost of σ_dept='CS'(Student) using:

a) Linear search

b) Binary search (assume Student is sorted by dept)

Solution:

a) Linear Search:

Cost = b_Student = 500 block accesses

b) Binary Search:

Expected result size: 500 tuples

Assuming 20 tuples per block (blocking factor = 20)

Cost = ⌈log₂(500)⌉ + ⌈500/20⌉ - 1 = 9 + 25 - 1 = 33 block accesses

Slide 26

Problem 4: Cost Estimation (Join)

Problem 4:

Calculate the cost of Student ⋈_{student_id} Enrollment using:

a) Nested loop join (Student as outer)

b) Block nested loop join (Student as outer, 5 buffer blocks available)

c) Hash join

Solution:

a) Nested Loop Join:

Cost = b_Student + (n_Student × b_Enrollment)
= 500 + (10,000 × 2,000) = 20,000,500 block accesses

b) Block Nested Loop Join:

With 5 buffers: 3 for outer, 1 for inner, 1 for output

Cost = b_Student + (⌈b_Student/3⌉ × b_Enrollment)
= 500 + (⌈500/3⌉ × 2,000) = 500 + (167 × 2,000) = 334,500 block accesses

c) Hash Join:

Cost = 3 × (b_Student + b_Enrollment)
= 3 × (500 + 2,000) = 7,500 block accesses

Slide 27

Problem 5: Query Plan Comparison

Problem 5:

Consider the query:

SELECT S.name, E.grade
FROM Student S, Enrollment E
WHERE S.student_id = E.student_id
  AND S.dept = 'CS'
  AND E.grade >= 3.5
                

Compare two plans:

Plan A: (Student × Enrollment) then select

Plan B: Select from each, then join

Solution:

Plan A Cost (Cartesian Product):

Intermediate result: 10,000 × 50,000 = 500,000,000 tuples

Cost = b_Student × b_Enrollment = 500 × 2,000 = 1,000,000 block accesses

Plus cost of selecting from huge intermediate result

Slide 28

Problem 5: Query Plan Comparison (Continued)

Plan B Cost (Select then Join):

Step 1: σ_dept='CS'(Student)

Result: ~500 tuples, ~25 blocks

Cost₁ = 500 blocks

Step 2: σ_grade≥3.5(Enrollment)

Result: Grades 3.5, 4.0 = 2 out of 11 values

Estimated: 50,000 × (2/11) ≈ 9,091 tuples, ~455 blocks

Cost₂ = 2,000 blocks

Step 3: Hash Join of results

Cost₃ = 3 × (25 + 455) = 1,440 blocks

Total Plan B Cost:

Total = 500 + 2,000 + 1,440 = 3,940 block accesses

Conclusion: Plan B is ~250× more efficient!

Slide 29

Problem 6: Histogram-Based Estimation

Problem 6:

Given an equi-depth histogram for Student.age:

Bucket	Age Range	Count	Distinct
1	18-22	4,000	5
2	23-27	3,000	5
3	28-40	2,000	13
4	41-67	1,000	27

Estimate the cardinality of σ_age=25(Student)

Solution:

Age 25 falls in Bucket 2 (23-27)

Using uniform distribution within the bucket:

n_σ = Count / Distinct = 3,000 / 5 = 600 tuples

Note: Compare with uniform assumption over entire range:

10,000 / 50 = 200 tuples (less accurate!)

Slide 30

Problem 7: Join Order Optimization

Problem 7:

Given three relations R, S, T with statistics:

Relation	Tuples	Blocks
R	1,000	100
S	5,000	250
T	10,000	500

Join conditions: R.a = S.a and S.b = T.b

V(a, R) = 100, V(a, S) = 100 (foreign key from R to S)

V(b, S) = 500, V(b, T) = 500 (foreign key from S to T)

Compare join orders using block nested loop join (3 buffers for outer):

a) (R ⋈ S) ⋈ T

b) (S ⋈ T) ⋈ R

Slide 31

Problem 7: Solution

Order A: (R ⋈ S) ⋈ T

Step 1: R ⋈ S (R outer, S inner)

Cardinality: 1,000 tuples (foreign key join)

Cost₁ = 100 + ⌈100/2⌉ × 250 = 100 + 50 × 250 = 12,600 blocks

Result blocks: assume 100 blocks

Step 2: (R ⋈ S) ⋈ T

Cardinality: 5,000 tuples (foreign key join from S to T)

Cost₂ = 100 + ⌈100/2⌉ × 500 = 100 + 50 × 500 = 25,100 blocks

Total Cost A: 12,600 + 25,100 = 37,700 blocks

Slide 32

Problem 7: Solution (Continued)

Order B: (S ⋈ T) ⋈ R

Step 1: S ⋈ T (S outer, T inner)

Cardinality: 5,000 tuples (foreign key join)

Cost₁ = 250 + ⌈250/2⌉ × 500 = 250 + 125 × 500 = 62,750 blocks

Result blocks: assume 250 blocks

Step 2: (S ⋈ T) ⋈ R

Cardinality: 5,000 tuples

Cost₂ = 250 + ⌈250/2⌉ × 100 = 250 + 125 × 100 = 12,750 blocks

Total Cost B: 62,750 + 12,750 = 75,500 blocks

Conclusion: Order A is approximately 2× better!

General Rule: Join smallest relations first to minimize intermediate results

Slide 33

Problem 8: Nested Query Transformation

Problem 8:

Transform the following correlated nested query into a join:

SELECT S.name, S.dept
FROM Student S
WHERE S.age > (
    SELECT AVG(age)
    FROM Student
    WHERE dept = S.dept
);
                

Analysis:

This is a correlated subquery (references outer S.dept)
Inner query executes once per outer tuple
Cost: O(n_Student²) if no optimization

Slide 34

Problem 8: Solution

Optimized Query (Using Join):

SELECT S.name, S.dept
FROM Student S,
     (SELECT dept, AVG(age) as avg_age
      FROM Student
      GROUP BY dept) AS DeptAvg
WHERE S.dept = DeptAvg.dept
  AND S.age > DeptAvg.avg_age;
                

Steps:

Compute department averages once (instead of per tuple)
Store in temporary table DeptAvg (20 tuples, 1 block)
Join Student with DeptAvg
Apply age comparison

Cost Analysis:

Step 1 - Compute averages: Scan Student + aggregate

Cost₁ = 500 blocks

Step 2 - Join + filter: Hash join

Cost₂ = 3 × (500 + 1) ≈ 1,503 blocks

Total: ~2,003 blocks vs. 500 × 10,000 = 5,000,000 blocks!

Slide 35

Problem 9: Multi-Attribute Selectivity

Problem 9:

Given the following statistics on Enrollment:

n_Enrollment = 50,000
V(course_id, Enrollment) = 500
V(grade, Enrollment) = 11
V(semester, Enrollment) = 8

Estimate cardinality for:

σ_{course_id='CS101' ∨ grade≥3.5}(Enrollment)

Solution:

First, calculate individual selectivities:

s₁ (course_id='CS101'):

s₁ = 1 / V(course_id, Enrollment) = 1/500 = 0.002

s₂ (grade≥3.5):

Grades: 0.0, 0.5, 1.0, ..., 4.0 (11 values)

Grades ≥ 3.5: {3.5, 4.0} = 2 values

s₂ = 2/11 ≈ 0.182

Slide 36

Problem 9: Solution (Continued)

For disjunction (OR), use formula:

s = 1 - (1 - s₁) × (1 - s₂)

Substituting values:

s = 1 - (1 - 0.002) × (1 - 0.182)
s = 1 - (0.998 × 0.818)
s = 1 - 0.816
s = 0.184

Therefore, estimated cardinality:

n_σ = 50,000 × 0.184 = 9,200 tuples

Interpretation: About 18.4% of enrollments match the condition

Note: If we had used simple addition (incorrect!):

s = s₁ + s₂ = 0.002 + 0.182 = 0.184

In this case, it gives the same result, but this is coincidental because s₁ is very small. For larger selectivities, the correct formula matters!

Slide 37

Problem 10: Comprehensive Query Optimization

Problem 10:

Optimize the following query and estimate the cost:

SELECT S.name, C.title, E.grade
FROM Student S, Enrollment E, Course C
WHERE S.student_id = E.student_id
  AND E.course_id = C.course_id
  AND S.dept = 'CS'
  AND C.credits = 4
  AND E.grade >= 3.0;
                

Given Statistics:

Student: 10,000 tuples, 500 blocks, V(dept)=20
Enrollment: 50,000 tuples, 2,000 blocks, V(grade)=11
Course: 500 tuples, 50 blocks, V(credits)=4
All joins are on foreign keys

Slide 38

Problem 10: Solution (Part 1)

Step 1: Apply Selections (Push Down)

σ_dept='CS'(Student):

n₁ = 10,000 / 20 = 500 tuples, ~25 blocks
Cost₁ = 500 blocks (linear scan)

σ_credits=4(Course):

n₂ = 500 / 4 = 125 tuples, ~13 blocks
Cost₂ = 50 blocks (linear scan)

σ_grade≥3.0(Enrollment):

Grades ≥3.0: {3.0, 3.5, 4.0} = 3 out of 11

n₃ = 50,000 × (3/11) ≈ 13,636 tuples, ~682 blocks
Cost₃ = 2,000 blocks (linear scan)

Total Selection Cost: 500 + 50 + 2,000 = 2,550 blocks

Slide 39

Problem 10: Solution (Part 2)

Step 2: Determine Join Order

We have three filtered relations:

Student': 500 tuples, 25 blocks
Course': 125 tuples, 13 blocks
Enrollment': 13,636 tuples, 682 blocks

Best Strategy: Join smallest relations first

Order: (Student' ⋈ Enrollment') ⋈ Course'

Step 3: First Join - Student' ⋈ Enrollment'

Using hash join:

Cost₄ = 3 × (25 + 682) = 3 × 707 = 2,121 blocks

Result cardinality: ~13,636 tuples (foreign key)

Result blocks: ~682 blocks

Slide 40

Problem 10: Solution (Part 3)

Step 4: Second Join - Result ⋈ Course'

Using hash join:

Cost₅ = 3 × (682 + 13) = 3 × 695 = 2,085 blocks

Final result cardinality:

Since each enrollment links to one course, and we filtered to credits=4:

n_final = 13,636 × (125/500) = 3,409 tuples

Step 5: Total Cost Summary

Operation	Cost (blocks)
Select Student	500
Select Course	50
Select Enrollment	2,000
Join Student' ⋈ Enrollment'	2,121
Join Result ⋈ Course'	2,085
TOTAL	6,756 blocks

Slide 41

Additional Practice Problems

Problem 11:

Explain why pushing selections down in a query tree generally improves performance. Give a counter-example where it might not help.

Problem 12:

Calculate the number of different join orders possible for a query with 5 relations. If evaluating each plan takes 1 microsecond, how long would exhaustive enumeration take?

Problem 13:

Given a relation R with n_R=1,000,000 and an equi-depth histogram with 10 buckets, what is the maximum and minimum number of tuples that could match σ_A=v(R) for any value v?

Problem 14:

Compare the cost of evaluating σ_{A=5 ∧ B=10}(R) using:
a) A single index on (A,B)
b) Two separate indexes on A and B with intersection
Assume: n_R=100,000, V(A)=100, V(B)=1000, index height=3

Slide 42

Key Formulas - Quick Reference

Cardinality Estimation:

Selection (equality): n_{σ_A=v(r)} = n_r / V(A,r)

Selection (range): n_{σ_A≤v(r)} = n_r × (v-min)/(max-min)

Conjunction: n_{σ_c1∧c2(r)} = n_r × s₁ × s₂

Disjunction: n_{σ_c1∨c2(r)} = n_r × (1-(1-s₁)(1-s₂))

Join (general): n_r⋈s = (n_r × n_s) / max(V(A,r), V(A,s))

Projection: n_{π_A(r)} ≤ min(V(A,r), n_r)

Aggregation: n_{_Aγ_F(B)(r)} = V(A,r)

Cost Estimation (I/O):

Linear scan: b_r

Nested loop join: b_r + (n_r × b_s)

Block nested loop: b_r + (b_r × b_s) / (M-2)

Hash join: 3(b_r + b_s)

Merge join: b_r + b_s + sort cost

Slide 43

References and Further Reading

Classic Papers:

Selinger et al. (1979) - "Access Path Selection in a Relational Database Management System" (System R)
Graefe & McKenna (1993) - "The Volcano Optimizer Generator"
Chaudhuri (1998) - "An Overview of Query Optimization in Relational Systems"

Questions?

Thank You!

Dr. Mohsin Dar

Cloud & Software Operations Cluster | UPES

Office Hours: As per schedule

Query Optimization

Database Systems - Lecture 12

Database Systems (MTech Semester I)

Dr. Mohsin Dar

Assistant Professor, Cloud & Software Operations Cluster

UPES

Lecture Outline

Introduction to Query Optimization

What is Query Optimization?

Goals of Query Optimization:

Query Optimization Process:

Cardinality Estimation

Definition:

Why is Cardinality Estimation Important?

Key Statistics Maintained:

Cardinality Estimation: Selection (σ)

Selection Operation: σpredicate(r)

Case 1: Equality Condition (A = v)

Case 2: Comparison Condition (A ≤ v)

Case 3: Conjunction (cond1 ∧ cond2)

Case 4: Disjunction (cond1 ∨ cond2)

Cardinality Estimation: Join Operations

Natural Join: r ⋈ s

Case 1: Join on Key of r (R ∩ S = primary key of r)

Case 2: Foreign Key Join (R ∩ S is foreign key from s to r)

Case 3: General Case (No key constraints)

Example:

Cardinality Estimation: Other Operations

Projection: πA(r)

Set Operations:

Aggregation: AγF(B)(r)

Histograms for Selectivity Estimation

Why Histograms?

Types of Histograms:

Equi-depth Histogram Example:

Equivalences of Relational Algebra (Part 1)

Why Equivalences Matter?

1. Selection Operations:

2. Projection Operations:

3. Join Operations:

Equivalences of Relational Algebra (Part 2)

4. Selection with Join (Push Selection):

5. Projection with Join (Push Projection):

6. Set Operations:

Query Plans

What is a Query Plan?

Components of a Query Plan:

Example Query:

Two Possible Query Plans:

Cost Estimation

What is Cost?

Cost Factors:

Cost Model Assumptions:

Cost Estimation for Operations

Selection: σA=v(r)

Join: r ⋈ s (Cost depends on algorithm)

Nested Queries

What are Nested Queries?

Types of Nested Queries:

Nested Query Optimization

Challenges:

Optimization Strategies:

Nested Query Optimization (Continued)

Example Transformation:

Complete Optimization Example

Given Query:

Step 1: Initial Query Tree (Cartesian Products)

Step 2: Push Selections Down

Complete Optimization Example (Continued)

Step 3: Replace Cartesian Products with Joins

Step 4: Choose Join Order

Join Ordering: Dynamic Programming

Problem:

System-R Style Dynamic Programming Algorithm:

Example: Join Order for {R, S, T}

Heuristic-Based Optimization

Common Heuristics:

Summary

Key Takeaways:

Optimization Process:

Selection Operation: σ_predicate(r)

Case 3: Conjunction (cond₁ ∧ cond₂)

Case 4: Disjunction (cond₁ ∨ cond₂)

Projection: π_A(r)

Aggregation: _Aγ_F(B)(r)

Selection: σ_A=v(r)