COMP 6521 – Two-Phase Multiway Merge-Sort (TPMWMS) Bag Union

We need to separate two different things:

File size in blocks (how many blocks the files are):

T1_records.txt and T2_records.txt have a fixed size:
blocks = ceil(#tuples / 40).
That does not depend on memory.

Same for the final BagUnion_Output.txt:
blocks_out = ceil(#distinct_tuples / 40).
Also independent of memory.

Total I/O cost in blocks (what IOTracker reports: blocksRead, blocksWritten):

This does change when you change memory.

With less memory:

More initial runs.

More merge passes.

The same data is read and written multiple times during TPMMS.

So blocksRead and blocksWritten (total I/Os) increase.

With more memory:

Fewer runs, fewer passes.

Each block is read/written fewer times.

Total I/O decreases.

So:

Unchanged with memory:

Size of the input files in blocks.

Size of the final output in blocks.

Contents of the final output.

Changes with memory:

Number of runs.

Number of merge passes.

Total time.

IOTracker.blocksRead and IOTracker.blocksWritten (total I/O cost).

Github link for project

Design, implement, and evaluate an external-memory algorithm that sorts T1 and T2, then merges them to produce a bag union under restricted memory.

Block: 4 KB 40 tuples / block Record: 100 bytes Mem: 51 MB & 101 MB

Project in one paragraph

Sort tables T1 and T2 externally (Phase 1), then merge the two sorted streams (Phase 2) to compute a bag union. If a tuple t appears n1 times in T1 and n2 times in T2, output exactly one entry t : n1+n2. Track total disk I/Os and time (ms) for each phase, report the number of distinct tuples, and how many output blocks (40 per block) would be needed. Run and compare with 51 MB and 101 MB memory limits.

Data format (fixed-width)

Field	Type	Length (bytes/chars)
Student ID	int	8
First Name	char	10
Last Name	char	10
Department	int	3
Program	int	3
SIN Number	int	9
Address	char	56
Total		100 bytes

Each 4 KB block holds 40 records (4000 B used + 96 B slack). Records are parsed by offsets; there are no delimiters.

What to implement

Phase 1 – External Sort

Read records from blocks; parse by fixed offsets.
Generate in-memory sorted runs (≤ memory M), spill to disk.
Multiway (k-way) merge runs → sorted T1 and sorted T2.

Phase 2 – Merge & Bag Union

Merge the two sorted streams in order.
Coalesce equal tuples and count multiplicities.
Produce t : n in memory; compute output block count (40/blk).
Do not include time for the final write in Phase 2 timing.

Performance testing

Minimum scale: create large instances for T1 and T2 with > 1,000,000 records each.
Measure & report (per phase): total disk I/Os and time (ms).
Also report: distinct tuples and output blocks if written to disk.
Compare results for memory -Xmx51m vs -Xmx101m.

Team of 3 – clear split of work

Role A – Core Engineer

Record parser & comparator (100B fixed).
Block I/O (4 KB, 40 rec/blk), loaders/packers.
Phase 1: run generation + k-way merge.

Role B – Merge & Correctness

Phase 2: two-way merge of sorted T1/T2.
Duplicate aggregation → t : n1+n2.
Oracle tests & edge cases (small datasets).

Role C – Performance & Report

Timing & I/O counters (per phase).
Dataset generator (>1M records each).
51MB vs 101MB runs, plots & write-up.

Deliverables & dates

Demo: Nov 19 (in lab; all members present).
Report: due Nov 26 (submit via Moodle; include compile/run instructions; code must run on lab computers).

What this project is (and isn’t)

Is: a focused external-memory operator (external sort + merge for bag union) with careful measurement.

Is not: a full DBMS (no SQL parser, optimizer, transactions, etc.).

Helpful extras

metrics_log_template.csv — log I/Os, times, distinct counts.
run_checklist.md — pre/post-run checklist.
Run with the JVM heap flag, e.g., -Xmx51m or -Xmx101m.