COMP 6521 – Two-Phase Multiway Merge-Sort (TPMWMS) Bag Union

We need to separate two different things:

File size in blocks (how many blocks the files are):

T1_records.txt and T2_records.txt have a fixed size:
blocks = ceil(#tuples / 40).
That does not depend on memory.

Same for the final BagUnion_Output.txt:
blocks_out = ceil(#distinct_tuples / 40).
Also independent of memory.

Total I/O cost in blocks (what IOTracker reports: blocksRead, blocksWritten):

This does change when you change memory.

With less memory:

More initial runs.

More merge passes.

The same data is read and written multiple times during TPMMS.

So blocksRead and blocksWritten (total I/Os) increase.

With more memory:

Fewer runs, fewer passes.

Each block is read/written fewer times.

Total I/O decreases.

So:

Unchanged with memory:

Size of the input files in blocks.

Size of the final output in blocks.

Contents of the final output.

Changes with memory:

Number of runs.

Number of merge passes.

Total time.

IOTracker.blocksRead and IOTracker.blocksWritten (total I/O cost).
Github link for project

Design, implement, and evaluate an external-memory algorithm that sorts T1 and T2, then merges them to produce a bag union under restricted memory.

Block: 4 KB 40 tuples / block Record: 100 bytes Mem: 51 MB & 101 MB

Project in one paragraph

Sort tables T1 and T2 externally (Phase 1), then merge the two sorted streams (Phase 2) to compute a bag union. If a tuple t appears n1 times in T1 and n2 times in T2, output exactly one entry t : n1+n2. Track total disk I/Os and time (ms) for each phase, report the number of distinct tuples, and how many output blocks (40 per block) would be needed. Run and compare with 51 MB and 101 MB memory limits.

Data format (fixed-width)

FieldTypeLength (bytes/chars)
Student IDint8
First Namechar10
Last Namechar10
Departmentint3
Programint3
SIN Numberint9
Addresschar56
Total100 bytes

Each 4 KB block holds 40 records (4000 B used + 96 B slack). Records are parsed by offsets; there are no delimiters.

What to implement

Phase 1 – External Sort

  • Read records from blocks; parse by fixed offsets.
  • Generate in-memory sorted runs (≤ memory M), spill to disk.
  • Multiway (k-way) merge runs → sorted T1 and sorted T2.

Phase 2 – Merge & Bag Union

  • Merge the two sorted streams in order.
  • Coalesce equal tuples and count multiplicities.
  • Produce t : n in memory; compute output block count (40/blk).
  • Do not include time for the final write in Phase 2 timing.

Performance testing

Team of 3 – clear split of work

Role A – Core Engineer

  • Record parser & comparator (100B fixed).
  • Block I/O (4 KB, 40 rec/blk), loaders/packers.
  • Phase 1: run generation + k-way merge.

Role B – Merge & Correctness

  • Phase 2: two-way merge of sorted T1/T2.
  • Duplicate aggregation → t : n1+n2.
  • Oracle tests & edge cases (small datasets).

Role C – Performance & Report

  • Timing & I/O counters (per phase).
  • Dataset generator (>1M records each).
  • 51MB vs 101MB runs, plots & write-up.

Deliverables & dates

What this project is (and isn’t)

Is: a focused external-memory operator (external sort + merge for bag union) with careful measurement.

Is not: a full DBMS (no SQL parser, optimizer, transactions, etc.).

Helpful extras