Naive and SPIMI Indexers, Query Processing, and Lossy Dictionary Compression
Fall 2025 – Reuters-21578 Dataset
This project implements a complete information retrieval pipeline, including a naive indexer, a SPIMI indexer, Boolean query processing (single and AND queries), and several lossy dictionary compression techniques. We evaluate their efficiency and compression effects on the Reuters-21578 collection and compare SPIMI’s performance to the naive approach on a controlled set of 10,000 term–docID pairs.
NEWID attribute.<TEXT> tag.The naive indexer builds an inverted index using a sort-merge strategy. It collects (term, docID) pairs for each token, sorts them, removes duplicates, and merges identical terms into postings lists. This method is straightforward but requires a complete sort of all pairs in memory.
The SPIMI indexer (Single-Pass In-Memory Indexing) builds postings lists incrementally as tokens are read. It avoids the expensive global sort by inserting each term into a hash table and directly appending the docID to its list. For timing comparison, both indexers were tested with a limit of 10,000 term–docID pairs.
The system supports Boolean retrieval with:
Five lossy preprocessing methods were implemented to analyze their impact on vocabulary size and token count:
The table below summarizes token and vocabulary reduction percentages (Δ%) compared to the baseline.
Values are generated by CompressionExperiment.py and exported as sub3_table.csv.
| method | (distinct) terms | nonpositional postings | ||||
|---|---|---|---|---|---|---|
| number | Δ% | T% | number | Δ% | T% | |
| unfiltered | 65,689 | 0 | 0 | 1,908,060 | 0 | 0 |
| no numbers | 62,787 | −4 | −4 | 1,718,513 | −10 | −10 |
| case folding | 46,110 | −27 | −30 | 1,595,814 | −7 | −16 |
| 30 stop words | 46,077 | 0 | −30 | 1,414,899 | −11 | −26 |
| 150 stop words | 45,946 | 0 | −30 | 1,258,849 | −21 | −34 |
| stemming | 35,135 | −24 | −47 | 1,197,410 | −5 | −37 |
Overall, case folding and stemming yield the largest vocabulary reduction, while stop word removal mainly reduces token count. The effects are consistent with Table 5.1 of the textbook.
Using a 10,000-pair sample, both indexers were timed as follows:
| Indexer | Pairs | Time (seconds) |
|---|---|---|
| Naive | 10,000 | 0.0127 |
| SPIMI | 10,000 | 0.0102 |
SPIMI achieved approximately a 20% speed improvement over the naive indexer, as it eliminates the sorting stage. On larger datasets, the advantage of SPIMI would become more pronounced.
To test retrieval consistency across compression methods, several queries were executed:
| Query | Normal | Case folding | 30 stop words | 150 stop words | No numbers | Stemming |
|---|---|---|---|---|---|---|
Single: reuter | … | … | … | … | … | … |
AND: oil AND price | … | … | … | … | … | … |
AND: market AND stock | … | … | … | … | … | … |
Example steps to reproduce the experiments:
python main.py to build the index and launch interactive query chat.CompressionExperiment.compression_table("reuters21578") to produce sub3_table.csv.This report template and narrative were partially generated with the help of ChatGPT (OpenAI GPT-5) for formatting and summarization. All code, experiments, and analysis were performed by the author.