COMP 479/6791 Project 1 Report

Naive and SPIMI Indexers, Query Processing, and Lossy Dictionary Compression

Fall 2025 – Reuters-21578 Dataset

Abstract

This project implements a complete information retrieval pipeline, including a naive indexer, a SPIMI indexer, Boolean query processing (single and AND queries), and several lossy dictionary compression techniques. We evaluate their efficiency and compression effects on the Reuters-21578 collection and compare SPIMI’s performance to the naive approach on a controlled set of 10,000 term–docID pairs.

1. Data and Pre-Processing

Corpus: Reuters-21578 SGML files (English newswire documents).
DocIDs are extracted from the NEWID attribute.
Tokens are sequences of alphanumeric characters (and apostrophes) extracted from each <TEXT> tag.
All punctuation is omitted; tokens are optionally filtered using compression techniques.

2. Implementation Overview

2.1 Naive Indexer

The naive indexer builds an inverted index using a sort-merge strategy. It collects (term, docID) pairs for each token, sorts them, removes duplicates, and merges identical terms into postings lists. This method is straightforward but requires a complete sort of all pairs in memory.

2.2 SPIMI Indexer

The SPIMI indexer (Single-Pass In-Memory Indexing) builds postings lists incrementally as tokens are read. It avoids the expensive global sort by inserting each term into a hash table and directly appending the docID to its list. For timing comparison, both indexers were tested with a limit of 10,000 term–docID pairs.

3. Query Processing

The system supports Boolean retrieval with:

Single-term queries: direct lookup of postings for a single term.
AND queries: intersection of postings lists using a two-pointer linear merge algorithm.

4. Lossy Dictionary Compression

Five lossy preprocessing methods were implemented to analyze their impact on vocabulary size and token count:

Case folding
Removal of 30 stop words
Removal of 150 stop words
Removal of numeric tokens
Stemming (Porter stemmer)

The table below summarizes token and vocabulary reduction percentages (Δ%) compared to the baseline. Values are generated by CompressionExperiment.py and exported as sub3_table.csv.

(distinct) terms and nonpositional postings with Δ% (relative) and T% (total vs unfiltered)
method	(distinct) terms			nonpositional postings
method	number	Δ%	T%	number	Δ%	T%
unfiltered	65,689	0	0	1,908,060	0	0
no numbers	62,787	−4	−4	1,718,513	−10	−10
case folding	46,110	−27	−30	1,595,814	−7	−16
30 stop words	46,077	0	−30	1,414,899	−11	−26
150 stop words	45,946	0	−30	1,258,849	−21	−34
stemming	35,135	−24	−47	1,197,410	−5	−37

Overall, case folding and stemming yield the largest vocabulary reduction, while stop word removal mainly reduces token count. The effects are consistent with Table 5.1 of the textbook.

5. Timing Comparison: Naive vs SPIMI

Using a 10,000-pair sample, both indexers were timed as follows:

Indexer	Pairs	Time (seconds)
Naive	10,000	0.0127
SPIMI	10,000	0.0102

SPIMI achieved approximately a 20% speed improvement over the naive indexer, as it eliminates the sorting stage. On larger datasets, the advantage of SPIMI would become more pronounced.

6. Retrieval Comparison

To test retrieval consistency across compression methods, several queries were executed:

Query	Normal	Case folding	30 stop words	150 stop words	No numbers	Stemming
Single: `reuter`	…	…	…	…	…	…
AND: `oil AND price`	…	…	…	…	…	…
AND: `market AND stock`	…	…	…	…	…	…

7. Demo and Usage

Example steps to reproduce the experiments:

Run python main.py to build the index and launch interactive query chat.
Run CompressionExperiment.compression_table("reuters21578") to produce sub3_table.csv.
Observe timings for naive and SPIMI (limit = 10 000 pairs) in the console output.

8. Discussion and Future Work

SPIMI performs faster and scales better for large corpora.
Stop word and stemming trade off compression versus retrieval quality.
Future improvements include phrase queries, ranking models, and persistent on-disk index structures.

9. Academic Integrity and AI Disclosure

This report template and narrative were partially generated with the help of ChatGPT (OpenAI GPT-5) for formatting and summarization. All code, experiments, and analysis were performed by the author.