COMP 479/6791 Project 1 Report

Naive and SPIMI Indexers, Query Processing, and Lossy Dictionary Compression

Fall 2025 – Reuters-21578 Dataset

Abstract

This project implements a complete information retrieval pipeline, including a naive indexer, a SPIMI indexer, Boolean query processing (single and AND queries), and several lossy dictionary compression techniques. We evaluate their efficiency and compression effects on the Reuters-21578 collection and compare SPIMI’s performance to the naive approach on a controlled set of 10,000 term–docID pairs.

1. Data and Pre-Processing

2. Implementation Overview

2.1 Naive Indexer

The naive indexer builds an inverted index using a sort-merge strategy. It collects (term, docID) pairs for each token, sorts them, removes duplicates, and merges identical terms into postings lists. This method is straightforward but requires a complete sort of all pairs in memory.

2.2 SPIMI Indexer

The SPIMI indexer (Single-Pass In-Memory Indexing) builds postings lists incrementally as tokens are read. It avoids the expensive global sort by inserting each term into a hash table and directly appending the docID to its list. For timing comparison, both indexers were tested with a limit of 10,000 term–docID pairs.

3. Query Processing

The system supports Boolean retrieval with:

4. Lossy Dictionary Compression

Five lossy preprocessing methods were implemented to analyze their impact on vocabulary size and token count:

  1. Case folding
  2. Removal of 30 stop words
  3. Removal of 150 stop words
  4. Removal of numeric tokens
  5. Stemming (Porter stemmer)

The table below summarizes token and vocabulary reduction percentages (Δ%) compared to the baseline. Values are generated by CompressionExperiment.py and exported as sub3_table.csv.

(distinct) terms and nonpositional postings with Δ% (relative) and T% (total vs unfiltered)
method (distinct) terms nonpositional postings
number Δ% T% number Δ% T%
unfiltered 65,689 0 0 1,908,060 0 0
no numbers 62,787 −4 −4 1,718,513 −10 −10
case folding 46,110 −27 −30 1,595,814 −7 −16
30 stop words 46,077 0 −30 1,414,899 −11 −26
150 stop words 45,946 0 −30 1,258,849 −21 −34
stemming 35,135 −24 −47 1,197,410 −5 −37

Overall, case folding and stemming yield the largest vocabulary reduction, while stop word removal mainly reduces token count. The effects are consistent with Table 5.1 of the textbook.

5. Timing Comparison: Naive vs SPIMI

Using a 10,000-pair sample, both indexers were timed as follows:

IndexerPairsTime (seconds)
Naive10,0000.0127
SPIMI10,0000.0102

SPIMI achieved approximately a 20% speed improvement over the naive indexer, as it eliminates the sorting stage. On larger datasets, the advantage of SPIMI would become more pronounced.

6. Retrieval Comparison

To test retrieval consistency across compression methods, several queries were executed:

Query Normal Case folding 30 stop words 150 stop words No numbers Stemming
Single: reuter
AND: oil AND price
AND: market AND stock

7. Demo and Usage

Example steps to reproduce the experiments:

  1. Run python main.py to build the index and launch interactive query chat.
  2. Run CompressionExperiment.compression_table("reuters21578") to produce sub3_table.csv.
  3. Observe timings for naive and SPIMI (limit = 10 000 pairs) in the console output.

8. Discussion and Future Work

9. Academic Integrity and AI Disclosure

This report template and narrative were partially generated with the help of ChatGPT (OpenAI GPT-5) for formatting and summarization. All code, experiments, and analysis were performed by the author.