The objective of this project is to build a small information retrieval and clustering system on top of Concordia’s open access repository, "spectrum.library.concordia.ca". Starting from a seed URL, I did the following things:
Crawl and download up to a fixed number of PDF and HTML pages within the allowed domains.
Extract textual content from PDF documents and build an inverted index.
Define queries for sustainabipty and waste and construct a combined My-collection.
Represent documents using TF IDF and cluster My-collection using k means for k = 2, 10, 20.
The system is implemented in Python and is organized into three main components: a crawler, an indexer, and a clustering module. The data flow is illustrated in Figure 1.
The crawler (crawler.py) starts from the Spectrum seed URL and uses a breadth-first search (BFS) strategy with a queue of URLs. It enforces both domain restrictions and robot exclusion rules:
| Component | Library | Role |
|---|---|---|
| Crawler crawler.py | requests, BeautifulSoup | BFS crawl within allowed domains, obeys robots.txt, discovers PDFs and HTML pages. |
| PDF Extractor | PyPDF2 | Reads each page of PDF files and extracts raw text for indexing. |
| HTML Text Reader | BeautifulSoup | Extracts visible text for link discovery. |
| Indexer Indexer.py | Custom InvertedIndex class | Tokenizes and filters text, updates term:doc postings, maintains doc lengths, saves index to JSON. |
| Clusterer cluster.py | scikit-learn KMeans, TfidfVectorizer | Builds TF–IDF features and clusters My-collection for different k values. |
The InvertedIndex class maintains: A mapping from integer docid to URL, and URL back to docid. An index: term : {docid : term_frequency}. Document lengths: total valid tokens per document.
Tokenization and filtering include:Lowercasing and regex-based extraction of alphabetic tokens with length 3–25. Removal of tokens with no vowels to reduce PDF noise and check if it is english. Removal of purely numeric tokens. A stop word list: “the”, “and”, “with”... to filter very common non content terms.
| Statistic | Value | Notes |
|---|---|---|
| Total indexed documents | 194 | After applying domain, robots, and PDF constraints. |
| Unique terms in index | ≈ 54,438 | After token filtering and stop word removal. |
| Index storage | inverted_index.json | JSON representation; so we can reloading without recrawling. |
After setting crawling limit to 500, the helper function that returns all documents containing terms that include a given substring sustainability and waste :
| Collection | Definition | Result documents size from 500 limits |
|---|---|---|
| Sustainability | Docs with at least one term containing sustainability. | 32 |
| Waste | Docs with at least one term containing waste. | 28 |
| Sustainability ∩ Waste | Docs in both sustainability and waste sets. | 28 |
| My-collection | Union of sustainability and waste documents. | 42 |
For clustering, reconstruct each document’s text from the index by repeating tokens according to their term frequencies. then apply TfidfVectorizer from scikit-learn with max_df = 0.5 ,terms appearing in more than 50% of documents are discarded. Also, efault tokenization parameters on top of our already filtered tokens.
Two TF–IDF variants are used:
We apply k-means clustering to My-collection under three settings:
K=2,k=10,k=20For each cluster, we sort the center’s TF–IDF weights and print the top 50 terms. These terms serve as cluster labels and are recorded in the provided 500_docs_50_terms_cluster.txt file.
The detailed results can be found in 500_docs_50_terms_cluster.txt
With k = 2 and global IDF, My-collection is split into two broad thematic clusters:
| Cluster | Representative Terms | Interpretation |
|---|---|---|
| Cluster 0 | energy, bus, vehicles, emissions, electric, hydrogen, fuel, transportation, pricing, consumption, wildlife, species, microclimate | Technical and environmental topics such as energy systems, transportation, environmental impacts, wildlife and ecology. |
| Cluster 1 | art, humour, digital, preservation, hollywood, writers, union, labour, students, community, design, project | Arts, media, social and labour issues, including Hollywood labour relations and broader cultural projects. |
At k = 10, clusters become more topic-specific. Selected examples are summarized below:
| Cluster Theme | Representative Terms | Description |
|---|---|---|
| Hollywood Strikes & AI | hollywood, sag, aftra, wga, writers, strike, streaming, residuals, studios, generative, ai, actors | Documents analyzing recent Hollywood writers’ and actors’ strikes, the streaming economy, and the role of generative AI. |
| Pharmaceutical Patents | patent, pharmaceutical, medicine, novartis, trips, trade, innovation, law, health, advocacy | Work focusing on intellectual property, global health, and access to medicines. |
| Green Building & Urban Sustainability | greenbuilding, emissions, environmental, sustainable, building, urban, development, reduce, design | Research on sustainable buildings, urban design, and emission reduction strategies. |
| Ecology & Wildlife | wildlife, species, habitat, temperature, microclimate, hotspots, monitoring | Environmental and ecological studies, often linking waste or sustainability to ecosystem effects. |
With k = 20, clusters become too small:
When TF–IDF is computed only on the 42 documents in My-collection, we observe that terms that are frequent across the full 194-document index but not particularly common inside My-collection can gain weight and show up as top cluster terms. Very common sustainability related words within My-collection are down weighted, emphasizing more subtle differences between sustainability papers, for examples: green building vs wildlife vs energy systems. Local-IDF clusters tend to highlight collection specific vocabulary, while global-IDF clusters highlight terms that distinguish sustainability documents from the broader Spectrum corpus.
Overall, the project demonstrates how crawling, indexing, and clustering can be combined to explore thematic structure in a real academic repository, with a particular focus on sustainability and waste related research.
End of report.