Two-Phase Multiway Merge-Sort (2PMMS)
Phase 1: Run Generation
- Fill memory with blocks, sort (quicksort), write sorted runs
- Each block read + written once
- #runs ≈ ⌈(N·R)/M⌉, where N=#tuples, R=tuple size, M=memory
Phase 2: Multiway Merge
- Merge (M/B – 1) runs per pass (need 1 output buffer)
- k passes → runs shrink: r / (M/B – 1)k
- Total I/O ≈ (2k+1)(N·R/B)
Exam: Be ready to compute I/O costs for given N,R,M,B and #passes.
Worked Example
- 10M tuples × 100B → 1 GB relation
- Block = 4KB = 40 tuples → 250k blocks
- Memory = 50 MB = 12,800 blocks
- Phase 1: 250k/12.8k ≈ 20 runs → ≈ 89.7 min
- Phase 2: ≈ 89.7 min → Total ≈ 179 min (≈ 3h)
Key: Understand how to derive fill counts, run counts, and total I/O time.
Optimizing Disk Access
Strategies
- Cylindrification: group blocks by cylinders → Phase 1 speedup (90min → 4.5min)
- Multiple disks (striping): parallelize fills, near-linear speedup
- Mirroring: duplicates for reliability, parallel reads
- Elevator algorithm: schedule seeks like elevator floors
- Prefetching / double buffering: overlap I/O and computation
Exam: Be ready to explain why Phase 2 can’t be sped up by cylindrification, and compare scheduling vs striping.
Disk Reliability & RAID
Failure Detection
- Checksums/parity bits: detect errors (prob undetected = 1/2ⁿ for n parity bits)
RAID Levels
- RAID 1: Mirroring, high reliability, cost ×2
- RAID 4: One dedicated parity disk, XOR-based recovery
- RAID 5: Distributed parity, balances load
- RAID 6: Tolerates 2 simultaneous failures, uses redundancy patterns
Exam: Expect to compute recovery steps (XOR recompute), probability of failure, or identify recoverable crash pairs.
Likely Exam Topics
- Explain 2PMMS steps, compute #runs, #passes, I/O time
- Compare performance with/without optimization (cylinders, striping)
- Work small toy examples (few tuples/blocks, minimal memory)
- Disk scheduling: FIFO vs Elevator comparison
- Reliability: parity checks, probability of undetected errors
- RAID 1–6: differences, recovery process, pros/cons
Mix of calculations (I/O time, passes, probabilities) and conceptual questions (why RAID 5 distributes parity, why mirror helps reads).