COMP 479/6791

Information Retrieval and Web Search
















Project report

on

Project 2: Spectrum Crawler, Indexer & Document Clustering



Course Instructor:

Sabine Bergler, PhD


















author:

qiantongzhou-40081938

1. Introduction

The objective of this project is to build a small information retrieval and clustering system on top of Concordia’s open access repository, "spectrum.library.concordia.ca". Starting from a seed URL, I did the following things:

Crawl and download up to a fixed number of PDF and HTML pages within the allowed domains.

Extract textual content from PDF documents and build an inverted index.

Define queries for sustainabipty and waste and construct a combined My-collection.

Represent documents using TF IDF and cluster My-collection using k means for k = 2, 10, 20.

2. System Design

The system is implemented in Python and is organized into three main components: a crawler, an indexer, and a clustering module. The data flow is illustrated in Figure 1.

Figure 1. System Pipeline Diagram
Crawler crawler.py Seed URL + robots Inverted Index Indexer.py to inverted_index.json My-collection “sustainability” ∪ “waste” k-means cluster.py k=2,10,20 Cluster labels & top terms

The crawler (crawler.py) starts from the Spectrum seed URL and uses a breadth-first search (BFS) strategy with a queue of URLs. It enforces both domain restrictions and robot exclusion rules:

Component Library Role
Crawler crawler.py requests, BeautifulSoup BFS crawl within allowed domains, obeys robots.txt, discovers PDFs and HTML pages.
PDF Extractor PyPDF2 Reads each page of PDF files and extracts raw text for indexing.
HTML Text Reader BeautifulSoup Extracts visible text for link discovery.
Indexer Indexer.py Custom InvertedIndex class Tokenizes and filters text, updates term:doc postings, maintains doc lengths, saves index to JSON.
Clusterer cluster.py scikit-learn KMeans, TfidfVectorizer Builds TF–IDF features and clusters My-collection for different k values.

The InvertedIndex class maintains:
A mapping from integer docid to URL, and URL back to docid.
An index: term : {docid : term_frequency}.
Document lengths: total valid tokens per document.

Tokenization and filtering include:
Lowercasing and regex-based extraction of alphabetic tokens with length 3–25.
Removal of tokens with no vowels to reduce PDF noise and check if it is english.
Removal of purely numeric tokens.
A stop word list: “the”, “and”, “with”... to filter very common non content terms.

Statistic Value Notes
Total indexed documents 194 After applying domain, robots, and PDF constraints.
Unique terms in index ≈ 54,438 After token filtering and stop word removal.
Index storage inverted_index.json JSON representation; so we can reloading without recrawling.

3. Collection Construction results

After setting crawling limit to 500, the helper function that returns all documents containing terms that include a given substring sustainability and waste :

Collection Definition Result documents size from 500 limits
Sustainability Docs with at least one term containing sustainability. 32
Waste Docs with at least one term containing waste. 28
Sustainability ∩ Waste Docs in both sustainability and waste sets. 28
My-collection Union of sustainability and waste documents. 42

4. Clustering Methodology

For clustering, reconstruct each document’s text from the index by repeating tokens according to their term frequencies. then apply TfidfVectorizer from scikit-learn with max_df = 0.5 ,terms appearing in more than 50% of documents are discarded. Also, efault tokenization parameters on top of our already filtered tokens.

Two TF–IDF variants are used:

k-means Clustering Settings

We apply k-means clustering to My-collection under three settings:

K=2,k=10,k=20

For each cluster, we sort the center’s TF–IDF weights and print the top 50 terms. These terms serve as cluster labels and are recorded in the provided 500_docs_50_terms_cluster.txt file.

5. Clustering Results

The detailed results can be found in 500_docs_50_terms_cluster.txt

Global TF–IDF, k = 2

With k = 2 and global IDF, My-collection is split into two broad thematic clusters:

Cluster Representative Terms Interpretation
Cluster 0 energy, bus, vehicles, emissions, electric, hydrogen, fuel, transportation, pricing, consumption, wildlife, species, microclimate Technical and environmental topics such as energy systems, transportation, environmental impacts, wildlife and ecology.
Cluster 1 art, humour, digital, preservation, hollywood, writers, union, labour, students, community, design, project Arts, media, social and labour issues, including Hollywood labour relations and broader cultural projects.
At k = 2, the system roughly separates technical & environmental modeling documents from arts & media & policy documents that still intersect with sustainability and waste.

Global TF–IDF, k = 10

At k = 10, clusters become more topic-specific. Selected examples are summarized below:

Cluster Theme Representative Terms Description
Hollywood Strikes & AI hollywood, sag, aftra, wga, writers, strike, streaming, residuals, studios, generative, ai, actors Documents analyzing recent Hollywood writers’ and actors’ strikes, the streaming economy, and the role of generative AI.
Pharmaceutical Patents patent, pharmaceutical, medicine, novartis, trips, trade, innovation, law, health, advocacy Work focusing on intellectual property, global health, and access to medicines.
Green Building & Urban Sustainability greenbuilding, emissions, environmental, sustainable, building, urban, development, reduce, design Research on sustainable buildings, urban design, and emission reduction strategies.
Ecology & Wildlife wildlife, species, habitat, temperature, microclimate, hotspots, monitoring Environmental and ecological studies, often linking waste or sustainability to ecosystem effects.

Global TF–IDF, k = 20

With k = 20, clusters become too small:

Local vs Global IDF

When TF–IDF is computed only on the 42 documents in My-collection, we observe that terms that are frequent across the full 194-document index but not particularly common inside My-collection can gain weight and show up as top cluster terms.
Very common sustainability related words within My-collection are down weighted, emphasizing more subtle differences between sustainability papers, for examples: green building vs wildlife vs energy systems.
Local-IDF clusters tend to highlight collection specific vocabulary, while global-IDF clusters highlight terms that distinguish sustainability documents from the broader Spectrum corpus.

6. Conclusion

Effect of k

Overall, the project demonstrates how crawling, indexing, and clustering can be combined to explore thematic structure in a real academic repository, with a particular focus on sustainability and waste related research.

End of report.