Query Execution & Algorithms – Exam Notes
1. Query Processor Overview
-
Query compilation
- Parsing: SQL → relational algebra (RA) expression tree
- Rewrite: transform to equivalent but cheaper logical plan
- Physical plan: choose algorithms + iterator passing of tuples
-
Query execution:
run the physical plan using iterators (
Open,
GetNext, Close)
2. Relational Algebra & Bags
Basic SELECT–FROM–WHERE is roughly
\(\pi(\sigma(\text{product of input relations}))\).
- DISTINCT → duplicate elimination \(\delta\)
- GROUP BY → grouping \(\gamma\)
- ORDER BY → sorting \(\tau\)
Sets vs. Bags
- Standard RA: sets (no duplicates)
- SQL: bags (multisets, duplicates allowed)
Bag semantics
Let \(\mathrm{Card}(t, R)\) be the number of copies of tuple \(t\) in \(R\).
-
Bag union:
\[
\mathrm{Card}(t, R \cup S)
= \mathrm{Card}(t, R) + \mathrm{Card}(t, S)
\]
-
Bag intersection:
\[
\mathrm{Card}(t, R \cap S)
= \min\big(\mathrm{Card}(t, R), \mathrm{Card}(t, S)\big)
\]
-
Bag difference:
\[
\mathrm{Card}(t, R - S)
= \max\big(\mathrm{Card}(t, R) - \mathrm{Card}(t, S), 0\big)
\]
RA & SQL mapping
- \(\sigma\) (selection) ↔
WHERE
- \(\pi\) (projection) ↔
SELECT
- \(\times\) (product) ↔
FROM cross product
- Joins ↔
JOIN, NATURAL JOIN, etc.
- \(\delta\) ↔
DISTINCT
- \(\gamma\) ↔
GROUP BY
- \(\tau\) ↔
ORDER BY
3. Basic RA Operators
-
Selection \(\sigma_C(R)\):
filters rows satisfying condition \(C\); attributes unchanged.
-
Projection \(\pi_L(R)\):
keeps attribute list \(L\), may create duplicates (in bags).
-
Product \(R \times S\):
all combinations of tuples from \(R\) and \(S\).
-
Natural join \(R \bowtie S\):
equijoin on common attributes; remove duplicate join columns.
-
Theta-join \(R \bowtie_\theta S\):
join with arbitrary condition \(\theta\).
- Sorting \(\tau_L(R)\): sort on attributes \(L\).
- Duplicate elimination \(\delta(R)\).
-
Grouping \(\gamma_{L, \text{aggs}}(R)\):
group on \(L\), compute aggregates.
4. Expression Trees
- Internal representation of SQL as tree of RA operators.
- Leaves are base relations; internal nodes are operators.
- Query rewrite = tree transformations to reduce cost.
5. Scans & Iterator Model
Scanning a relation \(R\)
- Table scan: read all blocks of \(R\) sequentially.
-
Index scan:
use an index to retrieve qualifying tuples.
-
Sort scan:
scan in sorted order (sorted file or external sort).
Iterator interface
Open(): initialize operator.
GetNext(): return next tuple.
Close(): free resources.
6. Cost Parameters
- \(M\): # of memory buffers (blocks).
- \(B(R)\): # of disk blocks of relation \(R\).
- \(T(R)\): # of tuples of relation \(R\).
- \(V(R, a)\): # of distinct values of attribute \(a\) in \(R\).
- \(V(R, L)\): # of distinct combinations in attribute list \(L\).
Sequential scan of \(R\):
- If \(R\) is clustered: cost \(\approx B(R)\) I/Os.
-
If completely unclustered: cost \(\approx T(R)\) I/Os (one tuple per
block).
7. Classification of Algorithms
-
By method:
sort-based, hash-based, index-based.
-
By passes:
one-pass, two-pass, multipass.
-
By operator type:
unary vs. binary, tuple-at-a-time vs. full-relation.
8. One-Pass Algorithms
8.1 Tuple-at-a-time (selection, projection)
- Need only \(M \ge 1\) block.
- Cost: one scan of relation.
- Algorithm: read tuple, process, output or discard.
8.2 Duplicate Elimination \(\delta(R)\)
-
Use in-memory hash table of tuples seen so far; only output new tuples.
-
Requirement: all distinct tuples fit in memory
(roughly #distinct tuples \(\le\) hash capacity with \(M\) blocks).
8.3 Grouping \(\gamma_{L}(R)\)
-
Maintain one in-memory entry per group (per distinct \(L\)-value).
-
Requirement: \(V(R, L)\) (number of groups) must fit in memory.
8.4 One-pass Binary Operators
General requirement:
\[
\min\big(B(R), B(S)\big) \le M
\]
-
Load smaller relation fully into memory, scan the other, perform:
- Union, intersection, difference.
- Join (build-probe style).
- Cartesian product etc.
- Cost: \(B(R) + B(S)\) I/Os.
9. Nested-Loop Joins (One-and-a-half-pass)
9.1 Tuple-based nested-loop join
for each tuple r in R:
for each tuple s in S:
if join condition holds:
output (r, s)
- Cost \(\approx T(R) \times T(S)\) (too large in practice).
9.2 Block-based nested-loop join
Assume \(B(S) \le B(R)\) and \(B(S) > M\).
- Read \(M-1\) blocks of \(S\) into memory at a time.
- For each chunk of \(S\), scan all blocks of \(R\).
Cost formula:
\[
\text{Cost} =
\frac{B(S)}{M-1} \cdot \big((M-1) + B(R)\big)
= B(S) + \frac{B(S)B(R)}{M-1}
\approx \frac{B(S)B(R)}{M}
\]
10. Two-Pass Sort-Based Algorithms (TPMMS)
10.1 External sort (TPMMS)
- Pass 1: read \(M\) blocks at a time, sort in memory, write runs.
- Pass 2: merge runs into final sorted sequence.
- Requirement: #runs \(\le M-1\).
-
Condition:
\[
\frac{B(R)}{M} \le M-1
\quad\Rightarrow\quad
B(R) \le M(M-1) \approx M^2
\]
10.2 Duplicate elimination via sorting
- Sort \(R\) using TPMMS.
- Merge phase: output tuple only if different from previous output.
-
Cost:
\[
2B(R) + B(R) = 3B(R)\ \text{I/Os}
\]
- Requirement: \(B(R) \le M^2\).
10.3 Sort-based set operations
- Sort \(R\) and \(S\), merge to compute \(\cup, \cap, -\).
-
Cost:
\[
\approx 3\big(B(R)+B(S)\big)
\]
- Requirement: \(B(R) \le M^2, B(S) \le M^2\).
10.4 Sort-merge join
For join \(R(Y) \bowtie S(Y)\):
- Sort \(R\) and \(S\) on join key \(Y\).
- Merge: walk through both sorted lists and join matching keys.
-
Simple cost:
\[
\approx 5\big(B(R) + B(S)\big)
\]
(e.g., 4 for sorting both + 1 for merge)
- Requirement: \(B(R) \le M^2, B(S) \le M^2\).
11. Two-Pass Hash-Based Algorithms
11.1 Hash partitioning
- Use hash function \(h\) to partition relation into buckets.
-
Each bucket has \(\approx B(R)/(M-1)\) blocks; aim:
\(\le M\) so bucket fits in memory.
- Leads to condition \(B(R) \le M^2\) (roughly).
11.2 Hash-based duplicate elimination
- Pass 1: partition \(R\) using hash on all attributes.
- Pass 2: for each bucket, run one-pass \(\delta\).
-
Cost:
\[
3B(R)\ \text{I/Os}
\]
- Requirement: partitions small enough (approx \(B(R) \le M^2\)).
11.3 Hash-based grouping
- Partition by grouping attributes.
- For each bucket, one-pass grouping in memory.
- Cost and requirement similar to duplicate elimination.
11.4 Partitioned hash-join
For \(R \bowtie S\) on attributes \(Y\):
- Pass 1: partition \(R\) and \(S\) into buckets \(R_i, S_i\) using same hash on \(Y\).
- Pass 2: for each \(i\), build hash table on smaller of \(R_i, S_i\) and probe with the other.
-
Cost:
\[
3\big(B(R) + B(S)\big)
\]
-
Requirement (two-pass):
\[
\min\big(B(R), B(S)\big) \le M^2
\]
12. Index-Based Algorithms
12.1 Clustering
- Clustered relation: blocks contain only tuples of that relation.
- Clustering index: data physically ordered by index key.
- Non-clustering index: pointers to tuples scattered across many blocks.
12.2 Index-based selection \(\sigma_{a=c}(R)\)
After index lookup:
-
If index is clustering:
\[
\text{blocks} \approx \frac{B(R)}{V(R,a)}
\]
-
If non-clustering:
\[
\text{I/Os} \approx \frac{T(R)}{V(R,a)}
\]
(each tuple fetch may be a separate I/O).
12.3 Index nested-loop join
Join \(R(X,Y) \bowtie S(Y,Z)\), index on \(S.Y\):
- Scan \(R\); for each tuple \(r\), probe index on \(S\) with \(r.Y\).
- Let expected matches per \(r\) be \(T(S)/V(S,Y)\).
-
Additional I/Os:
-
Non-clustering index:
\(\approx T(R) \cdot T(S)/V(S,Y)\).
-
Clustering index:
\(\approx T(R) \cdot B(S)/V(S,Y)\).
12.4 Joins with sorted (B-tree) indexes
- Sort-merge-like: sort one side, scan the other via ordered index.
- If both sides have sorted indexes, can use zig-zag merge join.
13. Multipass Sort-Based Algorithms
13.1 Sorting a single relation
Let \(s(M,k)\) = max #blocks we can sort in \(k\) passes.
- \(s(M,1) = M\)
- \(s(M,k+1) = M \cdot s(M,k)\)
- \(s(M,k) = M^k\)
-
Condition to sort \(R\) in \(k\) passes:
\[
B(R) \le M^k
\]
-
I/O cost:
\[
2k \cdot B(R)
\]
(read + write each pass).
13.2 Multipass sort-based join
Let \(j(M,k)\) = max total blocks \(B(R)+B(S)\) joinable in \(k\) passes.
- \(j(M,k) \approx M^k\).
-
Condition:
\[
B(R) + B(S) \le M^k
\]
-
Cost:
\[
(2k - 1)\big(B(R) + B(S)\big)
\]
14. Multipass Hash-Based Algorithms
14.1 Unary operations (single relation)
Let \(u(M,k)\) = max #blocks we can handle in \(k\) passes.
- \(u(M,1) = M\) (one-pass case).
-
Recurrence:
\[
u(M, k) = (M-1)\cdot u(M, k-1)
\]
-
Closed form:
\[
u(M,k) = M (M-1)^{k-1} \approx M^k
\]
- Condition for unary hash-based op in \(k\) passes: \(B(R) \le u(M,k)\).
14.2 Binary operations
Let \(j(M,k)\) = max #blocks of the smaller relation we can handle in \(k\) passes.
- \(j(M,1) = M-1\) (fits in memory).
-
Recurrence:
\[
j(M,k) = (M-1) \cdot j(M,k-1)
\Rightarrow j(M,k) = (M-1)^k
\]
-
Condition for k-pass hash-based binary op:
\[
\min\big(B(R), B(S)\big) \le (M-1)^k
\]
15. Quick Strategy Summary
-
One-pass:
- σ, π: always if \(M \ge 1\).
- δ, γ: all distinct tuples/groups fit in memory.
- Binary ops: \(\min(B(R), B(S)) \le M\).
-
Two-pass sort-based:
- Need sorted output or using sort-merge join.
- Condition: \(B \le M^2\) (per relation).
-
Two-pass hash-based:
- For grouping, duplicate elimination, joins without needing order.
- Condition: sizes fit with \(B \le M^2\)-type bounds.
-
Index-based:
- Highly selective σ with good index.
- Joins where one side has clustering/sorted index on join key.
-
Multipass:
- Use when relations exceed \(M^2\).
- Sort-based: \(B \le M^k\).
- Hash-based unary: \(B \le M(M-1)^{k-1}\).
- Hash-based binary (smaller side): \(B \le (M-1)^k\).