Spectrum Crawler, Indexer, and Clustering Report

1. Introduction

The objective of this project is to build a small information retrieval and clustering system on top of Concordia’s open access repository, "spectrum.library.concordia.ca". Starting from a seed URL, I did the following things:

Crawl and download up to a fixed number of PDF and HTML pages within the allowed domains.

Extract textual content from PDF documents and build an inverted index.

Define queries for sustainabipty and waste and construct a combined My-collection.

Represent documents using TF IDF and cluster My-collection using k means for k = 2, 10, 20.

2. System Design

The system is implemented in Python and is organized into three main components: a crawler, an indexer, and a clustering module. The data flow is illustrated in Figure 1.

Figure 1. System Pipeline Diagram

The crawler (crawler.py) starts from the Spectrum seed URL and uses a breadth-first search (BFS) strategy with a queue of URLs. It enforces both domain restrictions and robot exclusion rules:

Allowed hosts spectrum.library.concordia.ca and library.concordia.ca.
Robots uses RobotFileParser to parse robots.txt and checks can_fetch() before requesting any URL.
A maximum file limit,in my code 500, is enforced a counter passed into crawlandindex(seed, max_files).

Component	Library	Role
Crawler crawler.py	requests, BeautifulSoup	BFS crawl within allowed domains, obeys robots.txt, discovers PDFs and HTML pages.
PDF Extractor	PyPDF2	Reads each page of PDF files and extracts raw text for indexing.
HTML Text Reader	BeautifulSoup	Extracts visible text for link discovery.
Indexer Indexer.py	Custom InvertedIndex class	Tokenizes and filters text, updates term:doc postings, maintains doc lengths, saves index to JSON.
Clusterer cluster.py	scikit-learn KMeans, TfidfVectorizer	Builds TF–IDF features and clusters My-collection for different k values.

The InvertedIndex class maintains:
A mapping from integer docid to URL, and URL back to docid.
An index: term : {docid : term_frequency}.
Document lengths: total valid tokens per document.

Tokenization and filtering include:
Lowercasing and regex-based extraction of alphabetic tokens with length 3–25.
Removal of tokens with no vowels to reduce PDF noise and check if it is english.
Removal of purely numeric tokens.
A stop word list: “the”, “and”, “with”... to filter very common non content terms.

Statistic	Value	Notes
Total indexed documents	194	After applying domain, robots, and PDF constraints.
Unique terms in index	≈ 54,438	After token filtering and stop word removal.
Index storage	inverted_index.json	JSON representation; so we can reloading without recrawling.

3. Collection Construction results

After setting crawling limit to 500, the helper function that returns all documents containing terms that include a given substring sustainability and waste :

Sustainability collection: docs where some term contains sustainability.
Waste collection: docs where some term contains waste.
Intersection: docs appearing in both previous sets.
My-collection: union of the sustainability and waste collections.

Collection	Definition	Result documents size from 500 limits
Sustainability	Docs with at least one term containing sustainability.	32
Waste	Docs with at least one term containing waste.	28
Sustainability ∩ Waste	Docs in both sustainability and waste sets.	28
My-collection	Union of sustainability and waste documents.	42

4. Clustering Methodology

For clustering, reconstruct each document’s text from the index by repeating tokens according to their term frequencies. then apply TfidfVectorizer from scikit-learn with max_df = 0.5 ,terms appearing in more than 50% of documents are discarded. Also, efault tokenization parameters on top of our already filtered tokens.

Two TF–IDF variants are used:

1.Global TF–IDF that fit on all 194 documents, then select rows corresponding to My-collection.
2.Local TF–IDF that fit only on the 42 My-collection documents, then cluster in that space.

k-means Clustering Settings

We apply k-means clustering to My-collection under three settings:

K=2,k=10,k=20

For each cluster, we sort the center’s TF–IDF weights and print the top 50 terms. These terms serve as cluster labels and are recorded in the provided 500_docs_50_terms_cluster.txt file.

5. Clustering Results

The detailed results can be found in 500_docs_50_terms_cluster.txt

Global TF–IDF, k = 2

With k = 2 and global IDF, My-collection is split into two broad thematic clusters:

Cluster	Representative Terms	Interpretation
Cluster 0	energy, bus, vehicles, emissions, electric, hydrogen, fuel, transportation, pricing, consumption, wildlife, species, microclimate	Technical and environmental topics such as energy systems, transportation, environmental impacts, wildlife and ecology.
Cluster 1	art, humour, digital, preservation, hollywood, writers, union, labour, students, community, design, project	Arts, media, social and labour issues, including Hollywood labour relations and broader cultural projects.

At k = 2, the system roughly separates technical & environmental modeling documents from arts & media & policy documents that still intersect with sustainability and waste.

Global TF–IDF, k = 10

At k = 10, clusters become more topic-specific. Selected examples are summarized below:

Cluster Theme	Representative Terms	Description
Hollywood Strikes & AI	hollywood, sag, aftra, wga, writers, strike, streaming, residuals, studios, generative, ai, actors	Documents analyzing recent Hollywood writers’ and actors’ strikes, the streaming economy, and the role of generative AI.
Pharmaceutical Patents	patent, pharmaceutical, medicine, novartis, trips, trade, innovation, law, health, advocacy	Work focusing on intellectual property, global health, and access to medicines.
Green Building & Urban Sustainability	greenbuilding, emissions, environmental, sustainable, building, urban, development, reduce, design	Research on sustainable buildings, urban design, and emission reduction strategies.
Ecology & Wildlife	wildlife, species, habitat, temperature, microclimate, hotspots, monitoring	Environmental and ecological studies, often linking waste or sustainability to ecosystem effects.

Global TF–IDF, k = 20

With k = 20, clusters become too small:

Some clusters capture very specific case studies a single Hollywood related thesis with many unique terms.
Other clusters are dominated by unusual proper nouns or technical jargon, making them harder to interpret.
Because there are only 42 documents, several clusters end up with very few members, which is not ideal for k-means.

Local vs Global IDF

When TF–IDF is computed only on the 42 documents in My-collection, we observe that terms that are frequent across the full 194-document index but not particularly common inside My-collection can gain weight and show up as top cluster terms.
Very common sustainability related words within My-collection are down weighted, emphasizing more subtle differences between sustainability papers, for examples: green building vs wildlife vs energy systems.
Local-IDF clusters tend to highlight collection specific vocabulary, while global-IDF clusters highlight terms that distinguish sustainability documents from the broader Spectrum corpus.

6. Conclusion

Effect of k

k = 2: Produces a clean split between technical & environmental work and arts & media & policy work. Good for a quick overview, but merges many distinct scientific and social subtopics.
k = 10: Provides a good balance, several clusters are clearly interpretable for example Hollywood strikes, pharmaceutical IP, green building, ecology.
k = 20: Over segments the small collection, some clusters are tiny or noisy and hard to label.

Overall, the project demonstrates how crawling, indexing, and clustering can be combined to explore thematic structure in a real academic repository, with a particular focus on sustainability and waste related research.

COMP 479/6791

Information Retrieval and Web Search

Project report

on

Project 2: Spectrum Crawler, Indexer & Document Clustering

Course Instructor:

Sabine Bergler, PhD

author:

qiantongzhou-40081938