Query optimization is the process of selecting the most efficient execution plan for a given SQL query from multiple possible plans.
Cardinality refers to the number of tuples (rows) in the result of a relational operation.
| Statistic | Symbol | Description |
|---|---|---|
| Cardinality | nr | Number of tuples in relation r |
| Block Size | br | Number of blocks containing r |
| Tuple Size | lr | Size of a tuple in relation r |
| Distinct Values | V(A,r) | Number of distinct values for attribute A |
Assumption: Values are uniformly distributed
Assumption: Values are uniformly distributed within range
where si is the selectivity factor of condition i
Assumption: Conditions are independent
Each tuple in s joins with at most one tuple in r
Each tuple in s joins with exactly one tuple in r
where A is the common attribute(s)
Worst Case (Cartesian Product): nr × ns
If Student has 1000 tuples, Enrollment has 5000 tuples, and V(student_id, Student) = 1000:
If A is a key: nπA(r) = nr
If duplicates removed: nπA(r) = V(A,r)
| Operation | Cardinality Estimate |
|---|---|
| Union: r ∪ s | nr + ns (upper bound) |
| Intersection: r ∩ s | min(nr, ns) |
| Difference: r - s | nr (upper bound) |
Group by A with aggregate function F on B
Result has one tuple per distinct value of grouping attributes
The uniform distribution assumption is often incorrect in real data. Histograms provide more accurate estimates.
| Bucket | Range | Count | Distinct Values |
|---|---|---|---|
| 1 | [1-25] | 500 | 20 |
| 2 | [26-45] | 500 | 15 |
| 3 | [46-75] | 500 | 25 |
| 4 | [76-100] | 500 | 18 |
Usage: For query σage=30(Person), look up bucket 2: estimate ≈ 500/15 ≈ 33 tuples
Multiple equivalent expressions can produce the same result with different costs. Query optimizers use these rules to transform queries.
Cascade: σc1∧c2(r) ≡ σc1(σc2(r))
Commutativity: σc1(σc2(r)) ≡ σc2(σc1(r))
Cascade: πL1(πL2(...πLn(r))) ≡ πL1(r)
if L1 ⊆ L2 ⊆ ... ⊆ Ln
Commutativity: r ⋈ s ≡ s ⋈ r
Associativity: (r ⋈ s) ⋈ t ≡ r ⋈ (s ⋈ t)
These properties allow reordering joins for efficiency
σc(r ⋈ s) ≡ σc(r) ⋈ s
if condition c involves only attributes of r
Benefit: Reduces intermediate result size
πL(r ⋈c s) ≡ πL(πL1(r) ⋈c πL2(s))
where L1 contains attributes from r needed for join and output
and L2 contains attributes from s needed for join and output
Union: r ∪ s ≡ s ∪ r (commutative)
Union: (r ∪ s) ∪ t ≡ r ∪ (s ∪ t) (associative)
Intersection: r ∩ s ≡ s ∩ r (commutative)
Selection with Union: σc(r ∪ s) ≡ σc(r) ∪ σc(s)
A query plan (or query execution plan) is a tree-like data structure representing the sequence of operations to execute a query.
Cost is a measure of resources consumed by a query execution plan.
Simplified Model: Cost = bT × tT + bS × tS
where bT = blocks transferred, tT = transfer time per block,
bS = seeks, tS = seek time
| Method | Cost (blocks) |
|---|---|
| Linear Search | br |
| Binary Search (sorted) | ⌈log2(br)⌉ + ⌈nσ/bfr⌉ - 1 |
| Primary Index (key) | hi + 1 |
| Secondary Index (key) | hi + 1 |
| Secondary Index (non-key) | hi + nσ |
where hi = index height, bfr = blocking factor
| Algorithm | Cost (blocks) |
|---|---|
| Nested Loop | br + (nr × bs) |
| Block Nested Loop | br + (br × bs) |
| Indexed Nested Loop | br + (nr × c) |
| Merge Join | br + bs + sort cost |
| Hash Join | 3(br + bs) |
where c = cost to access s using index per tuple of r
A nested query (or subquery) is a query within another query, typically appearing in the WHERE or HAVING clause.
1. Uncorrelated (Independent) Subquery:
Inner query executes once, independent of outer query
2. Correlated Subquery:
Inner query references outer query variables; executes for each outer tuple
1. Subquery Decorrelation (Unnesting):
Transform correlated subquery into a join
Before (Correlated):
After (Join):
2. Materialization:
Execute inner query once, store result, use for all outer tuples
3. Semi-Join Execution:
For EXISTS/IN queries, stop searching inner relation once match found
4. Index Usage:
Create temporary index on subquery result if used multiple times
Query with NOT IN:
Optimized as Anti-Join:
Cost: Very high due to Cartesian products
Transform to:
Statistics:
Option A: (Department ⋈ Professor) ⋈ Course
Cost: ~5×500 + result×400 ≈ 2500 + 50×400 = 22,500
Option B: (Professor ⋈ Course) ⋈ Department
Cost: ~500×400 + result×5 = 200,000 + (large)×5
Best Choice: Option A (start with smallest relation)
For n relations, there are O((2n)!/(n!)) possible join orders. For 10 relations: ~17 billion possibilities!
For each subset, keep only the cheapest plan
Complexity: O(3n) - much better than factorial!
| Size | Subsets Considered | Best Plan Stored |
|---|---|---|
| 1 | {R}, {S}, {T} | 3 plans |
| 2 | {R,S}, {R,T}, {S,T} | 3 plans |
| 3 | {R,S,T} | 1 plan |
Note: Heuristics provide good plans quickly but may not find optimal plan
Student(student_id, name, dept, age)
Enrollment(student_id, course_id, grade)
Course(course_id, title, credits, dept)
Estimate the cardinality of: σdept='CS'(Student)
Solution:
Using uniform distribution assumption:
Estimate the cardinality of: σage≤25(Student)
Solution:
Age range: 18-67 (total range = 49)
Values ≤ 25: 18-25 (range = 7)
Estimate the cardinality of: σdept='CS' ∧ age≤25(Student)
Solution:
Assuming independence:
Estimate the cardinality of: Student ⋈student_id Enrollment
(Assume student_id is a foreign key in Enrollment referencing Student)
Solution:
Foreign key join - each enrollment tuple joins with exactly one student:
Estimate the cardinality of: Enrollment ⋈course_id Course
(Assume course_id is a foreign key in Enrollment referencing Course)
Solution:
Foreign key join:
If we join all three: Student ⋈ Enrollment ⋈ Course, what is the estimated cardinality?
Solution:
Since both joins are on foreign keys:
Calculate the cost of σdept='CS'(Student) using:
a) Linear search
b) Binary search (assume Student is sorted by dept)
Solution:
a) Linear Search:
b) Binary Search:
Expected result size: 500 tuples
Assuming 20 tuples per block (blocking factor = 20)
Calculate the cost of Student ⋈student_id Enrollment using:
a) Nested loop join (Student as outer)
b) Block nested loop join (Student as outer, 5 buffer blocks available)
c) Hash join
Solution:
a) Nested Loop Join:
b) Block Nested Loop Join:
With 5 buffers: 3 for outer, 1 for inner, 1 for output
c) Hash Join:
Consider the query:
Compare two plans:
Plan A: (Student × Enrollment) then select
Plan B: Select from each, then join
Solution:
Plan A Cost (Cartesian Product):
Intermediate result: 10,000 × 50,000 = 500,000,000 tuples
Plus cost of selecting from huge intermediate result
Plan B Cost (Select then Join):
Step 1: σdept='CS'(Student)
Result: ~500 tuples, ~25 blocks
Step 2: σgrade≥3.5(Enrollment)
Result: Grades 3.5, 4.0 = 2 out of 11 values
Estimated: 50,000 × (2/11) ≈ 9,091 tuples, ~455 blocks
Step 3: Hash Join of results
Total Plan B Cost:
Conclusion: Plan B is ~250× more efficient!
Given an equi-depth histogram for Student.age:
| Bucket | Age Range | Count | Distinct |
|---|---|---|---|
| 1 | 18-22 | 4,000 | 5 |
| 2 | 23-27 | 3,000 | 5 |
| 3 | 28-40 | 2,000 | 13 |
| 4 | 41-67 | 1,000 | 27 |
Estimate the cardinality of σage=25(Student)
Solution:
Age 25 falls in Bucket 2 (23-27)
Using uniform distribution within the bucket:
Note: Compare with uniform assumption over entire range:
Given three relations R, S, T with statistics:
| Relation | Tuples | Blocks |
|---|---|---|
| R | 1,000 | 100 |
| S | 5,000 | 250 |
| T | 10,000 | 500 |
Join conditions: R.a = S.a and S.b = T.b
V(a, R) = 100, V(a, S) = 100 (foreign key from R to S)
V(b, S) = 500, V(b, T) = 500 (foreign key from S to T)
Compare join orders using block nested loop join (3 buffers for outer):
a) (R ⋈ S) ⋈ T
b) (S ⋈ T) ⋈ R
Order A: (R ⋈ S) ⋈ T
Step 1: R ⋈ S (R outer, S inner)
Cardinality: 1,000 tuples (foreign key join)
Result blocks: assume 100 blocks
Step 2: (R ⋈ S) ⋈ T
Cardinality: 5,000 tuples (foreign key join from S to T)
Total Cost A: 12,600 + 25,100 = 37,700 blocks
Order B: (S ⋈ T) ⋈ R
Step 1: S ⋈ T (S outer, T inner)
Cardinality: 5,000 tuples (foreign key join)
Result blocks: assume 250 blocks
Step 2: (S ⋈ T) ⋈ R
Cardinality: 5,000 tuples
Total Cost B: 62,750 + 12,750 = 75,500 blocks
Conclusion: Order A is approximately 2× better!
General Rule: Join smallest relations first to minimize intermediate results
Transform the following correlated nested query into a join:
Analysis:
Optimized Query (Using Join):
Steps:
Cost Analysis:
Step 1 - Compute averages: Scan Student + aggregate
Step 2 - Join + filter: Hash join
Total: ~2,003 blocks vs. 500 × 10,000 = 5,000,000 blocks!
Given the following statistics on Enrollment:
Estimate cardinality for:
σcourse_id='CS101' ∨ grade≥3.5(Enrollment)
Solution:
First, calculate individual selectivities:
s1 (course_id='CS101'):
s2 (grade≥3.5):
Grades: 0.0, 0.5, 1.0, ..., 4.0 (11 values)
Grades ≥ 3.5: {3.5, 4.0} = 2 values
For disjunction (OR), use formula:
Substituting values:
Therefore, estimated cardinality:
Interpretation: About 18.4% of enrollments match the condition
Note: If we had used simple addition (incorrect!):
In this case, it gives the same result, but this is coincidental because s1 is very small. For larger selectivities, the correct formula matters!
Optimize the following query and estimate the cost:
Given Statistics:
Step 1: Apply Selections (Push Down)
σdept='CS'(Student):
σcredits=4(Course):
σgrade≥3.0(Enrollment):
Grades ≥3.0: {3.0, 3.5, 4.0} = 3 out of 11
Total Selection Cost: 500 + 50 + 2,000 = 2,550 blocks
Step 2: Determine Join Order
We have three filtered relations:
Best Strategy: Join smallest relations first
Order: (Student' ⋈ Enrollment') ⋈ Course'
Step 3: First Join - Student' ⋈ Enrollment'
Using hash join:
Result cardinality: ~13,636 tuples (foreign key)
Result blocks: ~682 blocks
Step 4: Second Join - Result ⋈ Course'
Using hash join:
Final result cardinality:
Since each enrollment links to one course, and we filtered to credits=4:
Step 5: Total Cost Summary
| Operation | Cost (blocks) |
|---|---|
| Select Student | 500 |
| Select Course | 50 |
| Select Enrollment | 2,000 |
| Join Student' ⋈ Enrollment' | 2,121 |
| Join Result ⋈ Course' | 2,085 |
| TOTAL | 6,756 blocks |
Explain why pushing selections down in a query tree generally improves performance. Give a counter-example where it might not help.
Calculate the number of different join orders possible for a query with 5 relations. If evaluating each plan takes 1 microsecond, how long would exhaustive enumeration take?
Given a relation R with nR=1,000,000 and an equi-depth histogram with 10 buckets, what is the maximum and minimum number of tuples that could match σA=v(R) for any value v?
Compare the cost of evaluating σA=5 ∧ B=10(R) using:
a) A single index on (A,B)
b) Two separate indexes on A and B with intersection
Assume: nR=100,000, V(A)=100, V(B)=1000, index height=3
Selection (equality): nσA=v(r) = nr / V(A,r)
Selection (range): nσA≤v(r) = nr × (v-min)/(max-min)
Conjunction: nσc1∧c2(r) = nr × s1 × s2
Disjunction: nσc1∨c2(r) = nr × (1-(1-s1)(1-s2))
Join (general): nr⋈s = (nr × ns) / max(V(A,r), V(A,s))
Projection: nπA(r) ≤ min(V(A,r), nr)
Aggregation: nAγF(B)(r) = V(A,r)
Linear scan: br
Nested loop join: br + (nr × bs)
Block nested loop: br + (br × bs) / (M-2)
Hash join: 3(br + bs)
Merge join: br + bs + sort cost
Dr. Mohsin Dar
Cloud & Software Operations Cluster | UPES
Office Hours: As per schedule
Database Systems - MTech Semester I | Lecture 12: Query Optimization
UPES - University of Petroleum and Energy Studies