Query Execution & Algorithms – Exam Notes

1. Query Processor Overview

Query compilation
- Parsing: SQL → relational algebra (RA) expression tree
- Rewrite: transform to equivalent but cheaper logical plan
- Physical plan: choose algorithms + iterator passing of tuples
Query execution: run the physical plan using iterators (Open, GetNext, Close)

2. Relational Algebra & Bags

Basic SELECT–FROM–WHERE is roughly \(\pi(\sigma(\text{product of input relations}))\).

DISTINCT → duplicate elimination \(\delta\)
GROUP BY → grouping \(\gamma\)
ORDER BY → sorting \(\tau\)

Sets vs. Bags

Standard RA: sets (no duplicates)
SQL: bags (multisets, duplicates allowed)

Bag semantics

Let \(\mathrm{Card}(t, R)\) be the number of copies of tuple \(t\) in \(R\).

Bag union: \[ \mathrm{Card}(t, R \cup S) = \mathrm{Card}(t, R) + \mathrm{Card}(t, S) \]
Bag intersection: \[ \mathrm{Card}(t, R \cap S) = \min\big(\mathrm{Card}(t, R), \mathrm{Card}(t, S)\big) \]
Bag difference: \[ \mathrm{Card}(t, R - S) = \max\big(\mathrm{Card}(t, R) - \mathrm{Card}(t, S), 0\big) \]

RA & SQL mapping

\(\sigma\) (selection) ↔ WHERE
\(\pi\) (projection) ↔ SELECT
\(\times\) (product) ↔ FROM cross product
Joins ↔ JOIN, NATURAL JOIN, etc.
\(\delta\) ↔ DISTINCT
\(\gamma\) ↔ GROUP BY
\(\tau\) ↔ ORDER BY

3. Basic RA Operators

Selection \(\sigma_C(R)\): filters rows satisfying condition \(C\); attributes unchanged.
Projection \(\pi_L(R)\): keeps attribute list \(L\), may create duplicates (in bags).
Product \(R \times S\): all combinations of tuples from \(R\) and \(S\).
Natural join \(R \bowtie S\): equijoin on common attributes; remove duplicate join columns.
Theta-join \(R \bowtie_\theta S\): join with arbitrary condition \(\theta\).
Sorting \(\tau_L(R)\): sort on attributes \(L\).
Duplicate elimination \(\delta(R)\).
Grouping \(\gamma_{L, \text{aggs}}(R)\): group on \(L\), compute aggregates.

4. Expression Trees

Internal representation of SQL as tree of RA operators.
Leaves are base relations; internal nodes are operators.
Query rewrite = tree transformations to reduce cost.

5. Scans & Iterator Model

Scanning a relation \(R\)

Table scan: read all blocks of \(R\) sequentially.
Index scan: use an index to retrieve qualifying tuples.
Sort scan: scan in sorted order (sorted file or external sort).

Iterator interface

Open(): initialize operator.
GetNext(): return next tuple.
Close(): free resources.

6. Cost Parameters

\(M\): # of memory buffers (blocks).
\(B(R)\): # of disk blocks of relation \(R\).
\(T(R)\): # of tuples of relation \(R\).
\(V(R, a)\): # of distinct values of attribute \(a\) in \(R\).
\(V(R, L)\): # of distinct combinations in attribute list \(L\).

Sequential scan of \(R\):

If \(R\) is clustered: cost \(\approx B(R)\) I/Os.
If completely unclustered: cost \(\approx T(R)\) I/Os (one tuple per block).

7. Classification of Algorithms

By method: sort-based, hash-based, index-based.
By passes: one-pass, two-pass, multipass.
By operator type: unary vs. binary, tuple-at-a-time vs. full-relation.

8. One-Pass Algorithms

8.1 Tuple-at-a-time (selection, projection)

Need only \(M \ge 1\) block.
Cost: one scan of relation.
Algorithm: read tuple, process, output or discard.

8.2 Duplicate Elimination \(\delta(R)\)

Use in-memory hash table of tuples seen so far; only output new tuples.
Requirement: all distinct tuples fit in memory (roughly #distinct tuples \(\le\) hash capacity with \(M\) blocks).

8.3 Grouping \(\gamma_{L}(R)\)

Maintain one in-memory entry per group (per distinct \(L\)-value).
Requirement: \(V(R, L)\) (number of groups) must fit in memory.

8.4 One-pass Binary Operators

General requirement:

\[ \min\big(B(R), B(S)\big) \le M \]

Load smaller relation fully into memory, scan the other, perform:
- Union, intersection, difference.
- Join (build-probe style).
- Cartesian product etc.
Cost: \(B(R) + B(S)\) I/Os.

9. Nested-Loop Joins (One-and-a-half-pass)

9.1 Tuple-based nested-loop join

for each tuple r in R:
  for each tuple s in S:
    if join condition holds:
      output (r, s)

Cost \(\approx T(R) \times T(S)\) (too large in practice).

9.2 Block-based nested-loop join

Assume \(B(S) \le B(R)\) and \(B(S) > M\).

Read \(M-1\) blocks of \(S\) into memory at a time.
For each chunk of \(S\), scan all blocks of \(R\).

Cost formula:

\[ \text{Cost} = \frac{B(S)}{M-1} \cdot \big((M-1) + B(R)\big) = B(S) + \frac{B(S)B(R)}{M-1} \approx \frac{B(S)B(R)}{M} \]

10. Two-Pass Sort-Based Algorithms (TPMMS)

10.1 External sort (TPMMS)

Pass 1: read \(M\) blocks at a time, sort in memory, write runs.
Pass 2: merge runs into final sorted sequence.
Requirement: #runs \(\le M-1\).
Condition: \[ \frac{B(R)}{M} \le M-1 \quad\Rightarrow\quad B(R) \le M(M-1) \approx M^2 \]

10.2 Duplicate elimination via sorting

Sort \(R\) using TPMMS.
Merge phase: output tuple only if different from previous output.
Cost: \[ 2B(R) + B(R) = 3B(R)\ \text{I/Os} \]
Requirement: \(B(R) \le M^2\).

10.3 Sort-based set operations

Sort \(R\) and \(S\), merge to compute \(\cup, \cap, -\).
Cost: \[ \approx 3\big(B(R)+B(S)\big) \]
Requirement: \(B(R) \le M^2, B(S) \le M^2\).

10.4 Sort-merge join

For join \(R(Y) \bowtie S(Y)\):

Sort \(R\) and \(S\) on join key \(Y\).
Merge: walk through both sorted lists and join matching keys.
Simple cost: \[ \approx 5\big(B(R) + B(S)\big) \] (e.g., 4 for sorting both + 1 for merge)
Requirement: \(B(R) \le M^2, B(S) \le M^2\).

11. Two-Pass Hash-Based Algorithms

11.1 Hash partitioning

Use hash function \(h\) to partition relation into buckets.
Each bucket has \(\approx B(R)/(M-1)\) blocks; aim: \(\le M\) so bucket fits in memory.
Leads to condition \(B(R) \le M^2\) (roughly).

11.2 Hash-based duplicate elimination

Pass 1: partition \(R\) using hash on all attributes.
Pass 2: for each bucket, run one-pass \(\delta\).
Cost: \[ 3B(R)\ \text{I/Os} \]
Requirement: partitions small enough (approx \(B(R) \le M^2\)).

11.3 Hash-based grouping

Partition by grouping attributes.
For each bucket, one-pass grouping in memory.
Cost and requirement similar to duplicate elimination.

11.4 Partitioned hash-join

For \(R \bowtie S\) on attributes \(Y\):

Pass 1: partition \(R\) and \(S\) into buckets \(R_i, S_i\) using same hash on \(Y\).
Pass 2: for each \(i\), build hash table on smaller of \(R_i, S_i\) and probe with the other.
Cost: \[ 3\big(B(R) + B(S)\big) \]
Requirement (two-pass): \[ \min\big(B(R), B(S)\big) \le M^2 \]

12. Index-Based Algorithms

12.1 Clustering

Clustered relation: blocks contain only tuples of that relation.
Clustering index: data physically ordered by index key.
Non-clustering index: pointers to tuples scattered across many blocks.

12.2 Index-based selection \(\sigma_{a=c}(R)\)

After index lookup:

If index is clustering: \[ \text{blocks} \approx \frac{B(R)}{V(R,a)} \]
If non-clustering: \[ \text{I/Os} \approx \frac{T(R)}{V(R,a)} \] (each tuple fetch may be a separate I/O).

12.3 Index nested-loop join

Join \(R(X,Y) \bowtie S(Y,Z)\), index on \(S.Y\):

Scan \(R\); for each tuple \(r\), probe index on \(S\) with \(r.Y\).
Let expected matches per \(r\) be \(T(S)/V(S,Y)\).
Additional I/Os:
- Non-clustering index: \(\approx T(R) \cdot T(S)/V(S,Y)\).
- Clustering index: \(\approx T(R) \cdot B(S)/V(S,Y)\).

12.4 Joins with sorted (B-tree) indexes

Sort-merge-like: sort one side, scan the other via ordered index.
If both sides have sorted indexes, can use zig-zag merge join.

13. Multipass Sort-Based Algorithms

13.1 Sorting a single relation

Let \(s(M,k)\) = max #blocks we can sort in \(k\) passes.

\(s(M,1) = M\)
\(s(M,k+1) = M \cdot s(M,k)\)
\(s(M,k) = M^k\)
Condition to sort \(R\) in \(k\) passes: \[ B(R) \le M^k \]
I/O cost: \[ 2k \cdot B(R) \] (read + write each pass).

13.2 Multipass sort-based join

Let \(j(M,k)\) = max total blocks \(B(R)+B(S)\) joinable in \(k\) passes.

\(j(M,k) \approx M^k\).
Condition: \[ B(R) + B(S) \le M^k \]
Cost: \[ (2k - 1)\big(B(R) + B(S)\big) \]

14. Multipass Hash-Based Algorithms

14.1 Unary operations (single relation)

Let \(u(M,k)\) = max #blocks we can handle in \(k\) passes.

\(u(M,1) = M\) (one-pass case).
Recurrence: \[ u(M, k) = (M-1)\cdot u(M, k-1) \]
Closed form: \[ u(M,k) = M (M-1)^{k-1} \approx M^k \]
Condition for unary hash-based op in \(k\) passes: \(B(R) \le u(M,k)\).

14.2 Binary operations

Let \(j(M,k)\) = max #blocks of the smaller relation we can handle in \(k\) passes.

\(j(M,1) = M-1\) (fits in memory).
Recurrence: \[ j(M,k) = (M-1) \cdot j(M,k-1) \Rightarrow j(M,k) = (M-1)^k \]
Condition for k-pass hash-based binary op: \[ \min\big(B(R), B(S)\big) \le (M-1)^k \]

15. Quick Strategy Summary

One-pass:
- σ, π: always if \(M \ge 1\).
- δ, γ: all distinct tuples/groups fit in memory.
- Binary ops: \(\min(B(R), B(S)) \le M\).
Two-pass sort-based:
- Need sorted output or using sort-merge join.
- Condition: \(B \le M^2\) (per relation).
Two-pass hash-based:
- For grouping, duplicate elimination, joins without needing order.
- Condition: sizes fit with \(B \le M^2\)-type bounds.
Index-based:
- Highly selective σ with good index.
- Joins where one side has clustering/sorted index on join key.
Multipass:
- Use when relations exceed \(M^2\).
- Sort-based: \(B \le M^k\).
- Hash-based unary: \(B \le M(M-1)^{k-1}\).
- Hash-based binary (smaller side): \(B \le (M-1)^k\).