Query Optimization

Database Systems - Lecture 12

Unit II: Transaction Management
M.Tech - First Semester

Database Systems (MTech Semester I)

Dr. Mohsin Dar

Assistant Professor, Cloud & Software Operations Cluster

UPES

Slide 2

Lecture Outline

  1. Introduction to Query Optimization
  2. Cardinality Estimation
    • Selection, Projection, Join Operations
    • Set Operations and Aggregation
  3. Histograms for Selectivity Estimation
  4. Equivalences of Relational Algebra
  5. Query Plans and Execution Strategies
  6. Cost Estimation Models
  7. Nested Query Optimization
  8. Numerical Practice Problems
Slide 3

Introduction to Query Optimization

What is Query Optimization?

Query optimization is the process of selecting the most efficient execution plan for a given SQL query from multiple possible plans.

Goals of Query Optimization:

  • Minimize response time - Reduce query execution time
  • Minimize resource consumption - Reduce CPU, I/O, memory usage
  • Maximize throughput - Handle more queries concurrently

Query Optimization Process:

  1. Parse SQL query
  2. Generate query tree
  3. Apply algebraic transformations
  4. Enumerate possible plans
  1. Estimate cost for each plan
  2. Select optimal plan
  3. Execute selected plan
Slide 4

Cardinality Estimation

Definition:

Cardinality refers to the number of tuples (rows) in the result of a relational operation.

Why is Cardinality Estimation Important?

  • Determines cost of query execution plans
  • Helps choose optimal join algorithms
  • Guides index selection decisions
  • Critical for cost-based optimization

Key Statistics Maintained:

Statistic Symbol Description
Cardinality nr Number of tuples in relation r
Block Size br Number of blocks containing r
Tuple Size lr Size of a tuple in relation r
Distinct Values V(A,r) Number of distinct values for attribute A
Slide 5

Cardinality Estimation: Selection (σ)

Selection Operation: σpredicate(r)

Case 1: Equality Condition (A = v)

nσA=v(r) = nr / V(A,r)

Assumption: Values are uniformly distributed

Case 2: Comparison Condition (A ≤ v)

nσA≤v(r) = nr × (v - min(A,r)) / (max(A,r) - min(A,r))

Assumption: Values are uniformly distributed within range

Case 3: Conjunction (cond1 ∧ cond2)

nσc1∧c2(r) = nr × s1 × s2

where si is the selectivity factor of condition i

Assumption: Conditions are independent

Case 4: Disjunction (cond1 ∨ cond2)

nσc1∨c2(r) = nr × (1 - (1-s1) × (1-s2))
Slide 6

Cardinality Estimation: Join Operations

Natural Join: r ⋈ s

Case 1: Join on Key of r (R ∩ S = primary key of r)

nr⋈s ≤ ns

Each tuple in s joins with at most one tuple in r

Case 2: Foreign Key Join (R ∩ S is foreign key from s to r)

nr⋈s = ns

Each tuple in s joins with exactly one tuple in r

Case 3: General Case (No key constraints)

nr⋈s = (nr × ns) / max(V(A,r), V(A,s))

where A is the common attribute(s)

Worst Case (Cartesian Product): nr × ns

Example:

If Student has 1000 tuples, Enrollment has 5000 tuples, and V(student_id, Student) = 1000:

nStudent⋈Enrollment = (1000 × 5000) / 1000 = 5000
Slide 7

Cardinality Estimation: Other Operations

Projection: πA(r)

nπA(r) ≤ min(V(A,r), nr)

If A is a key: nπA(r) = nr

If duplicates removed: nπA(r) = V(A,r)

Set Operations:

Operation Cardinality Estimate
Union: r ∪ s nr + ns (upper bound)
Intersection: r ∩ s min(nr, ns)
Difference: r - s nr (upper bound)

Aggregation: AγF(B)(r)

Group by A with aggregate function F on B

nAγF(B)(r) = V(A,r)

Result has one tuple per distinct value of grouping attributes

Slide 8

Histograms for Selectivity Estimation

Why Histograms?

The uniform distribution assumption is often incorrect in real data. Histograms provide more accurate estimates.

Types of Histograms:

  • Equi-width Histogram: Divide domain into equal-sized buckets
  • Equi-depth (Equi-height) Histogram: Each bucket contains approximately same number of tuples
  • Compressed Histogram: Store only non-empty buckets
  • End-biased Histogram: Separate treatment for high-frequency values

Equi-depth Histogram Example:

Bucket Range Count Distinct Values
1 [1-25] 500 20
2 [26-45] 500 15
3 [46-75] 500 25
4 [76-100] 500 18

Usage: For query σage=30(Person), look up bucket 2: estimate ≈ 500/15 ≈ 33 tuples

Slide 9

Equivalences of Relational Algebra (Part 1)

Why Equivalences Matter?

Multiple equivalent expressions can produce the same result with different costs. Query optimizers use these rules to transform queries.

1. Selection Operations:

Cascade: σc1∧c2(r) ≡ σc1c2(r))

Commutativity: σc1c2(r)) ≡ σc2c1(r))

2. Projection Operations:

Cascade: πL1L2(...πLn(r))) ≡ πL1(r)

if L1 ⊆ L2 ⊆ ... ⊆ Ln

3. Join Operations:

Commutativity: r ⋈ s ≡ s ⋈ r

Associativity: (r ⋈ s) ⋈ t ≡ r ⋈ (s ⋈ t)

These properties allow reordering joins for efficiency

Slide 10

Equivalences of Relational Algebra (Part 2)

4. Selection with Join (Push Selection):

σc(r ⋈ s) ≡ σc(r) ⋈ s

if condition c involves only attributes of r

Benefit: Reduces intermediate result size

5. Projection with Join (Push Projection):

πL(r ⋈c s) ≡ πLL1(r) ⋈c πL2(s))

where L1 contains attributes from r needed for join and output

and L2 contains attributes from s needed for join and output

6. Set Operations:

Union: r ∪ s ≡ s ∪ r (commutative)

Union: (r ∪ s) ∪ t ≡ r ∪ (s ∪ t) (associative)

Intersection: r ∩ s ≡ s ∩ r (commutative)

Selection with Union: σc(r ∪ s) ≡ σc(r) ∪ σc(s)

Slide 11

Query Plans

What is a Query Plan?

A query plan (or query execution plan) is a tree-like data structure representing the sequence of operations to execute a query.

Components of a Query Plan:

  • Operators: Selection, Projection, Join, etc.
  • Access Methods: Sequential scan, index scan, etc.
  • Join Algorithms: Nested loop, hash join, merge join
  • Data Flow: Direction from leaves to root

Example Query:

SELECT S.name, E.grade FROM Student S, Enrollment E WHERE S.student_id = E.student_id AND S.dept = 'CS' AND E.grade > 3.5;

Two Possible Query Plans:

Plan A (Less Efficient):
  1. Cartesian Product: Student × Enrollment
  2. Selection: σconditions
  3. Projection: πname,grade
Plan B (More Efficient):
  1. Selection: σdept='CS'(Student)
  2. Selection: σgrade>3.5(Enrollment)
  3. Join: result1 ⋈ result2
  4. Projection: πname,grade
Slide 12

Cost Estimation

What is Cost?

Cost is a measure of resources consumed by a query execution plan.

Cost Factors:

  • I/O Cost: Number of disk block transfers (dominant factor)
  • CPU Cost: Time for in-memory operations
  • Memory Cost: Buffer space required
  • Network Cost: Data transfer in distributed systems

Simplified Model: Cost = bT × tT + bS × tS

where bT = blocks transferred, tT = transfer time per block,

bS = seeks, tS = seek time

Cost Model Assumptions:

Slide 13

Cost Estimation for Operations

Selection: σA=v(r)

Method Cost (blocks)
Linear Search br
Binary Search (sorted) ⌈log2(br)⌉ + ⌈nσ/bfr⌉ - 1
Primary Index (key) hi + 1
Secondary Index (key) hi + 1
Secondary Index (non-key) hi + nσ

where hi = index height, bfr = blocking factor

Join: r ⋈ s (Cost depends on algorithm)

Algorithm Cost (blocks)
Nested Loop br + (nr × bs)
Block Nested Loop br + (br × bs)
Indexed Nested Loop br + (nr × c)
Merge Join br + bs + sort cost
Hash Join 3(br + bs)

where c = cost to access s using index per tuple of r

Slide 14

Nested Queries

What are Nested Queries?

A nested query (or subquery) is a query within another query, typically appearing in the WHERE or HAVING clause.

Types of Nested Queries:

1. Uncorrelated (Independent) Subquery:

SELECT name FROM Student WHERE dept_id IN ( SELECT dept_id FROM Department WHERE location = 'Delhi' );

Inner query executes once, independent of outer query

2. Correlated Subquery:

SELECT name FROM Student S WHERE salary > ( SELECT AVG(salary) FROM Student WHERE dept = S.dept );

Inner query references outer query variables; executes for each outer tuple

Slide 15

Nested Query Optimization

Challenges:

Optimization Strategies:

1. Subquery Decorrelation (Unnesting):

Transform correlated subquery into a join

Before (Correlated):

SELECT S.name FROM Student S WHERE EXISTS ( SELECT * FROM Enrollment E WHERE E.student_id = S.student_id AND E.grade > 3.5 );

After (Join):

SELECT DISTINCT S.name FROM Student S, Enrollment E WHERE S.student_id = E.student_id AND E.grade > 3.5;
Slide 16

Nested Query Optimization (Continued)

2. Materialization:

Execute inner query once, store result, use for all outer tuples

  • Works well for uncorrelated subqueries
  • Reduces repeated computation
  • Requires space for materialized result

3. Semi-Join Execution:

For EXISTS/IN queries, stop searching inner relation once match found

4. Index Usage:

Create temporary index on subquery result if used multiple times

Example Transformation:

Query with NOT IN:

SELECT name FROM Student WHERE student_id NOT IN ( SELECT student_id FROM Enrollment );

Optimized as Anti-Join:

SELECT S.name FROM Student S LEFT OUTER JOIN Enrollment E ON S.student_id = E.student_id WHERE E.student_id IS NULL;
Slide 17

Complete Optimization Example

Given Query:

SELECT P.name, D.dept_name FROM Professor P, Department D, Course C WHERE P.dept_id = D.dept_id AND P.prof_id = C.prof_id AND D.location = 'Delhi' AND C.credits > 3;

Step 1: Initial Query Tree (Cartesian Products)

  • πP.name, D.dept_name
  • σP.dept_id=D.dept_id ∧ P.prof_id=C.prof_id ∧ D.location='Delhi' ∧ C.credits>3
  • Professor × Department × Course

Cost: Very high due to Cartesian products

Step 2: Push Selections Down

  • Apply σlocation='Delhi' to Department
  • Apply σcredits>3 to Course
  • Reduces size of intermediate results
Slide 18

Complete Optimization Example (Continued)

Step 3: Replace Cartesian Products with Joins

Transform to:

  • πP.name, D.dept_name
  • location='Delhi'(Department)) ⋈dept_id Professor ⋈prof_idcredits>3(Course))

Step 4: Choose Join Order

Statistics:

  • Department: 50 tuples, Delhi filter → 5 tuples
  • Professor: 500 tuples
  • Course: 1000 tuples, credits filter → 400 tuples

Option A: (Department ⋈ Professor) ⋈ Course

Cost: ~5×500 + result×400 ≈ 2500 + 50×400 = 22,500

Option B: (Professor ⋈ Course) ⋈ Department

Cost: ~500×400 + result×5 = 200,000 + (large)×5

Best Choice: Option A (start with smallest relation)

Slide 19

Join Ordering: Dynamic Programming

Problem:

For n relations, there are O((2n)!/(n!)) possible join orders. For 10 relations: ~17 billion possibilities!

System-R Style Dynamic Programming Algorithm:

  1. Step 1: Consider all single relations (base case)
  2. Step 2: Consider all ways to join 2 relations
  3. Step 3: Consider all ways to join 3 relations using results from Step 2
  4. Continue... until all n relations joined

For each subset, keep only the cheapest plan

Complexity: O(3n) - much better than factorial!

Example: Join Order for {R, S, T}

Size Subsets Considered Best Plan Stored
1 {R}, {S}, {T} 3 plans
2 {R,S}, {R,T}, {S,T} 3 plans
3 {R,S,T} 1 plan
Slide 20

Heuristic-Based Optimization

Common Heuristics:

  1. Perform selections as early as possible
    • Reduces cardinality of intermediate results
  2. Perform projections early
    • Reduces tuple size
  3. Perform most restrictive selections first
    • Smallest intermediate results
  4. Avoid Cartesian products when possible
    • Replace with joins when feasible
  5. Perform joins with smallest results first
    • Left-deep trees preferred for pipelining
  6. Use indexes when available
    • Especially for selections and joins

Note: Heuristics provide good plans quickly but may not find optimal plan

Slide 21

Summary

Key Takeaways:

  • Cardinality Estimation: Critical for cost-based optimization; uses statistics and histograms for accuracy
  • Histograms: Provide better selectivity estimates than uniform distribution assumption
  • Relational Algebra Equivalences: Enable query transformations; push selections and projections down
  • Query Plans: Multiple equivalent plans with vastly different costs
  • Cost Estimation: I/O cost dominates; depends on access methods and join algorithms
  • Nested Queries: Can be optimized through decorrelation, materialization, or semi-joins
  • Join Ordering: Dynamic programming finds optimal order efficiently

Optimization Process:

  1. Parse query into logical plan
  2. Apply heuristic transformations
  3. Generate physical plans with different algorithms
  4. Estimate cost using statistics
  5. Select cheapest plan for execution
Slide 22

Numerical Practice Problems

Database Schema for Problems:

Student(student_id, name, dept, age)

  • nStudent = 10,000 tuples
  • bStudent = 500 blocks
  • V(dept, Student) = 20 distinct departments
  • V(age, Student) = 50 distinct ages (18-67)

Enrollment(student_id, course_id, grade)

  • nEnrollment = 50,000 tuples
  • bEnrollment = 2000 blocks
  • V(course_id, Enrollment) = 500 distinct courses
  • V(grade, Enrollment) = 11 distinct grades (0.0-4.0 in 0.5 steps)

Course(course_id, title, credits, dept)

  • nCourse = 500 tuples
  • bCourse = 50 blocks
  • V(credits, Course) = 4 distinct values (1,2,3,4)
Slide 23

Problem 1: Cardinality Estimation (Selection)

Problem 1A:

Estimate the cardinality of: σdept='CS'(Student)

Solution:

Using uniform distribution assumption:

nσ = nStudent / V(dept, Student) = 10,000 / 20 = 500 tuples

Problem 1B:

Estimate the cardinality of: σage≤25(Student)

Solution:

Age range: 18-67 (total range = 49)

Values ≤ 25: 18-25 (range = 7)

nσ = nStudent × (7/49) = 10,000 × (7/49) ≈ 1,429 tuples

Problem 1C:

Estimate the cardinality of: σdept='CS' ∧ age≤25(Student)

Solution:

Assuming independence:

nσ = nStudent × (1/20) × (7/49) = 10,000 × 0.05 × 0.143 ≈ 71 tuples
Slide 24

Problem 2: Cardinality Estimation (Join)

Problem 2A:

Estimate the cardinality of: Student ⋈student_id Enrollment

(Assume student_id is a foreign key in Enrollment referencing Student)

Solution:

Foreign key join - each enrollment tuple joins with exactly one student:

njoin = nEnrollment = 50,000 tuples

Problem 2B:

Estimate the cardinality of: Enrollment ⋈course_id Course

(Assume course_id is a foreign key in Enrollment referencing Course)

Solution:

Foreign key join:

njoin = nEnrollment = 50,000 tuples

Problem 2C:

If we join all three: Student ⋈ Enrollment ⋈ Course, what is the estimated cardinality?

Solution:

Since both joins are on foreign keys:

njoin = nEnrollment = 50,000 tuples
Slide 25

Problem 3: Cost Estimation (Selection)

Problem 3A:

Calculate the cost of σdept='CS'(Student) using:

a) Linear search

b) Binary search (assume Student is sorted by dept)

Solution:

a) Linear Search:

Cost = bStudent = 500 block accesses

b) Binary Search:

Expected result size: 500 tuples

Assuming 20 tuples per block (blocking factor = 20)

Cost = ⌈log2(500)⌉ + ⌈500/20⌉ - 1 = 9 + 25 - 1 = 33 block accesses
Slide 26

Problem 4: Cost Estimation (Join)

Problem 4:

Calculate the cost of Student ⋈student_id Enrollment using:

a) Nested loop join (Student as outer)

b) Block nested loop join (Student as outer, 5 buffer blocks available)

c) Hash join

Solution:

a) Nested Loop Join:

Cost = bStudent + (nStudent × bEnrollment)
= 500 + (10,000 × 2,000) = 20,000,500 block accesses

b) Block Nested Loop Join:

With 5 buffers: 3 for outer, 1 for inner, 1 for output

Cost = bStudent + (⌈bStudent/3⌉ × bEnrollment)
= 500 + (⌈500/3⌉ × 2,000) = 500 + (167 × 2,000) = 334,500 block accesses

c) Hash Join:

Cost = 3 × (bStudent + bEnrollment)
= 3 × (500 + 2,000) = 7,500 block accesses
Slide 27

Problem 5: Query Plan Comparison

Problem 5:

Consider the query:

SELECT S.name, E.grade FROM Student S, Enrollment E WHERE S.student_id = E.student_id AND S.dept = 'CS' AND E.grade >= 3.5

Compare two plans:

Plan A: (Student × Enrollment) then select

Plan B: Select from each, then join

Solution:

Plan A Cost (Cartesian Product):

Intermediate result: 10,000 × 50,000 = 500,000,000 tuples

Cost = bStudent × bEnrollment = 500 × 2,000 = 1,000,000 block accesses

Plus cost of selecting from huge intermediate result

Slide 28

Problem 5: Query Plan Comparison (Continued)

Plan B Cost (Select then Join):

Step 1: σdept='CS'(Student)

Result: ~500 tuples, ~25 blocks

Cost1 = 500 blocks

Step 2: σgrade≥3.5(Enrollment)

Result: Grades 3.5, 4.0 = 2 out of 11 values

Estimated: 50,000 × (2/11) ≈ 9,091 tuples, ~455 blocks

Cost2 = 2,000 blocks

Step 3: Hash Join of results

Cost3 = 3 × (25 + 455) = 1,440 blocks

Total Plan B Cost:

Total = 500 + 2,000 + 1,440 = 3,940 block accesses

Conclusion: Plan B is ~250× more efficient!

Slide 29

Problem 6: Histogram-Based Estimation

Problem 6:

Given an equi-depth histogram for Student.age:

Bucket Age Range Count Distinct
1 18-22 4,000 5
2 23-27 3,000 5
3 28-40 2,000 13
4 41-67 1,000 27

Estimate the cardinality of σage=25(Student)

Solution:

Age 25 falls in Bucket 2 (23-27)

Using uniform distribution within the bucket:

nσ = Count / Distinct = 3,000 / 5 = 600 tuples

Note: Compare with uniform assumption over entire range:

10,000 / 50 = 200 tuples (less accurate!)
Slide 30

Problem 7: Join Order Optimization

Problem 7:

Given three relations R, S, T with statistics:

Relation Tuples Blocks
R 1,000 100
S 5,000 250
T 10,000 500

Join conditions: R.a = S.a and S.b = T.b

V(a, R) = 100, V(a, S) = 100 (foreign key from R to S)

V(b, S) = 500, V(b, T) = 500 (foreign key from S to T)

Compare join orders using block nested loop join (3 buffers for outer):

a) (R ⋈ S) ⋈ T

b) (S ⋈ T) ⋈ R

Slide 31

Problem 7: Solution

Order A: (R ⋈ S) ⋈ T

Step 1: R ⋈ S (R outer, S inner)

Cardinality: 1,000 tuples (foreign key join)

Cost1 = 100 + ⌈100/2⌉ × 250 = 100 + 50 × 250 = 12,600 blocks

Result blocks: assume 100 blocks

Step 2: (R ⋈ S) ⋈ T

Cardinality: 5,000 tuples (foreign key join from S to T)

Cost2 = 100 + ⌈100/2⌉ × 500 = 100 + 50 × 500 = 25,100 blocks

Total Cost A: 12,600 + 25,100 = 37,700 blocks

Slide 32

Problem 7: Solution (Continued)

Order B: (S ⋈ T) ⋈ R

Step 1: S ⋈ T (S outer, T inner)

Cardinality: 5,000 tuples (foreign key join)

Cost1 = 250 + ⌈250/2⌉ × 500 = 250 + 125 × 500 = 62,750 blocks

Result blocks: assume 250 blocks

Step 2: (S ⋈ T) ⋈ R

Cardinality: 5,000 tuples

Cost2 = 250 + ⌈250/2⌉ × 100 = 250 + 125 × 100 = 12,750 blocks

Total Cost B: 62,750 + 12,750 = 75,500 blocks

Conclusion: Order A is approximately 2× better!

General Rule: Join smallest relations first to minimize intermediate results

Slide 33

Problem 8: Nested Query Transformation

Problem 8:

Transform the following correlated nested query into a join:

SELECT S.name, S.dept FROM Student S WHERE S.age > ( SELECT AVG(age) FROM Student WHERE dept = S.dept );

Analysis:

  • This is a correlated subquery (references outer S.dept)
  • Inner query executes once per outer tuple
  • Cost: O(nStudent2) if no optimization
Slide 34

Problem 8: Solution

Optimized Query (Using Join):

SELECT S.name, S.dept FROM Student S, (SELECT dept, AVG(age) as avg_age FROM Student GROUP BY dept) AS DeptAvg WHERE S.dept = DeptAvg.dept AND S.age > DeptAvg.avg_age;

Steps:

  1. Compute department averages once (instead of per tuple)
  2. Store in temporary table DeptAvg (20 tuples, 1 block)
  3. Join Student with DeptAvg
  4. Apply age comparison

Cost Analysis:

Step 1 - Compute averages: Scan Student + aggregate

Cost1 = 500 blocks

Step 2 - Join + filter: Hash join

Cost2 = 3 × (500 + 1) ≈ 1,503 blocks

Total: ~2,003 blocks vs. 500 × 10,000 = 5,000,000 blocks!

Slide 35

Problem 9: Multi-Attribute Selectivity

Problem 9:

Given the following statistics on Enrollment:

  • nEnrollment = 50,000
  • V(course_id, Enrollment) = 500
  • V(grade, Enrollment) = 11
  • V(semester, Enrollment) = 8

Estimate cardinality for:

σcourse_id='CS101' ∨ grade≥3.5(Enrollment)

Solution:

First, calculate individual selectivities:

s1 (course_id='CS101'):

s1 = 1 / V(course_id, Enrollment) = 1/500 = 0.002

s2 (grade≥3.5):

Grades: 0.0, 0.5, 1.0, ..., 4.0 (11 values)

Grades ≥ 3.5: {3.5, 4.0} = 2 values

s2 = 2/11 ≈ 0.182
Slide 36

Problem 9: Solution (Continued)

For disjunction (OR), use formula:

s = 1 - (1 - s1) × (1 - s2)

Substituting values:

s = 1 - (1 - 0.002) × (1 - 0.182)
s = 1 - (0.998 × 0.818)
s = 1 - 0.816
s = 0.184

Therefore, estimated cardinality:

nσ = 50,000 × 0.184 = 9,200 tuples

Interpretation: About 18.4% of enrollments match the condition

Note: If we had used simple addition (incorrect!):

s = s1 + s2 = 0.002 + 0.182 = 0.184

In this case, it gives the same result, but this is coincidental because s1 is very small. For larger selectivities, the correct formula matters!

Slide 37

Problem 10: Comprehensive Query Optimization

Problem 10:

Optimize the following query and estimate the cost:

SELECT S.name, C.title, E.grade FROM Student S, Enrollment E, Course C WHERE S.student_id = E.student_id AND E.course_id = C.course_id AND S.dept = 'CS' AND C.credits = 4 AND E.grade >= 3.0;

Given Statistics:

  • Student: 10,000 tuples, 500 blocks, V(dept)=20
  • Enrollment: 50,000 tuples, 2,000 blocks, V(grade)=11
  • Course: 500 tuples, 50 blocks, V(credits)=4
  • All joins are on foreign keys
Slide 38

Problem 10: Solution (Part 1)

Step 1: Apply Selections (Push Down)

σdept='CS'(Student):

n1 = 10,000 / 20 = 500 tuples, ~25 blocks
Cost1 = 500 blocks (linear scan)

σcredits=4(Course):

n2 = 500 / 4 = 125 tuples, ~13 blocks
Cost2 = 50 blocks (linear scan)

σgrade≥3.0(Enrollment):

Grades ≥3.0: {3.0, 3.5, 4.0} = 3 out of 11

n3 = 50,000 × (3/11) ≈ 13,636 tuples, ~682 blocks
Cost3 = 2,000 blocks (linear scan)

Total Selection Cost: 500 + 50 + 2,000 = 2,550 blocks

Slide 39

Problem 10: Solution (Part 2)

Step 2: Determine Join Order

We have three filtered relations:

  • Student': 500 tuples, 25 blocks
  • Course': 125 tuples, 13 blocks
  • Enrollment': 13,636 tuples, 682 blocks

Best Strategy: Join smallest relations first

Order: (Student' ⋈ Enrollment') ⋈ Course'

Step 3: First Join - Student' ⋈ Enrollment'

Using hash join:

Cost4 = 3 × (25 + 682) = 3 × 707 = 2,121 blocks

Result cardinality: ~13,636 tuples (foreign key)

Result blocks: ~682 blocks

Slide 40

Problem 10: Solution (Part 3)

Step 4: Second Join - Result ⋈ Course'

Using hash join:

Cost5 = 3 × (682 + 13) = 3 × 695 = 2,085 blocks

Final result cardinality:

Since each enrollment links to one course, and we filtered to credits=4:

nfinal = 13,636 × (125/500) = 3,409 tuples

Step 5: Total Cost Summary

Operation Cost (blocks)
Select Student 500
Select Course 50
Select Enrollment 2,000
Join Student' ⋈ Enrollment' 2,121
Join Result ⋈ Course' 2,085
TOTAL 6,756 blocks
Slide 41

Additional Practice Problems

Problem 11:

Explain why pushing selections down in a query tree generally improves performance. Give a counter-example where it might not help.

Problem 12:

Calculate the number of different join orders possible for a query with 5 relations. If evaluating each plan takes 1 microsecond, how long would exhaustive enumeration take?

Problem 13:

Given a relation R with nR=1,000,000 and an equi-depth histogram with 10 buckets, what is the maximum and minimum number of tuples that could match σA=v(R) for any value v?

Problem 14:

Compare the cost of evaluating σA=5 ∧ B=10(R) using:
a) A single index on (A,B)
b) Two separate indexes on A and B with intersection
Assume: nR=100,000, V(A)=100, V(B)=1000, index height=3

Slide 42

Key Formulas - Quick Reference

Cardinality Estimation:

Selection (equality): nσA=v(r) = nr / V(A,r)

Selection (range): nσA≤v(r) = nr × (v-min)/(max-min)

Conjunction: nσc1∧c2(r) = nr × s1 × s2

Disjunction: nσc1∨c2(r) = nr × (1-(1-s1)(1-s2))

Join (general): nr⋈s = (nr × ns) / max(V(A,r), V(A,s))

Projection: nπA(r) ≤ min(V(A,r), nr)

Aggregation: nAγF(B)(r) = V(A,r)

Cost Estimation (I/O):

Linear scan: br

Nested loop join: br + (nr × bs)

Block nested loop: br + (br × bs) / (M-2)

Hash join: 3(br + bs)

Merge join: br + bs + sort cost

Slide 43

References and Further Reading

Recommended Textbooks:

  1. Silberschatz, Korth, Sudarshan - "Database System Concepts" (Chapter 13-14)
  2. Ramakrishnan & Gehrke - "Database Management Systems" (Chapter 12-15)
  3. Elmasri & Navathe - "Fundamentals of Database Systems" (Chapter 18-19)

Classic Papers:

  • Selinger et al. (1979) - "Access Path Selection in a Relational Database Management System" (System R)
  • Graefe & McKenna (1993) - "The Volcano Optimizer Generator"
  • Chaudhuri (1998) - "An Overview of Query Optimization in Relational Systems"

Questions?

Thank You!

Dr. Mohsin Dar

Cloud & Software Operations Cluster | UPES

Office Hours: As per schedule