Mock Exam – Introduction to Information Retrieval (MCQs)

Q1. In the Boolean retrieval model, a query is interpreted as: A ranked list of candidate documents A probabilistic score over the collection A vector in the term space A Boolean expression over terms

Q2. The data structure that maps each term to the list of documents containing it is called: Forward index Inverted index Incidence matrix Suffix array

Q3. Which operation is NOT a standard Boolean operator in IR? AND OR XOR NOT

Q4. A term–document incidence matrix is typically avoided in large IR systems because it is: Too slow to query Too sparse and space-inefficient Not Boolean Unordered

Q5. Which best describes processing the query “A AND B NOT C”? Intersect postings(A,B) then subtract postings(C) Union postings(A,B) then subtract postings(C) Subtract C from A then union with B Rank documents by tf-idf

Q6. Why is grep inadequate as an IR engine for large collections? It cannot handle lower-case text It is line-oriented and does not support document-level operations efficiently It only supports wildcard queries It requires a GPU

Q7. Query optimization in Boolean retrieval typically focuses on: Rewriting queries to maximize recall Ordering posting list operations to minimize intermediate results Converting Boolean queries to vectors Removing stop words from documents only

Q8. In an inverted index, postings lists are commonly stored: Sorted by document ID Unsorted Sorted by term frequency In insertion order only

Q9. Which query would likely return zero results in strict Boolean retrieval but nonzero in ranked retrieval? "cat OR dog" "standard user dlink 650 no card found" "information AND retrieval" "the"

Q10. “Feast or famine” refers to: Too many or too few results in Boolean search Memory overflows during indexing Zipfian frequency spikes Server load balancing issues

II. Vocabulary, Tokenization & Postings (Q11–Q20)

Q11. A token is best defined as: A unique vocabulary entry An instance of a word/term occurring in a document A normalized form of a word A language identifier

Q12. Normalization addresses: Converting terms to a canonical form Assigning Part-of-Speech tags Chunking sentences Translating the document

Q13. Which is a challenge in tokenization? Lack of capitalization in English Whitespace absence in some languages Numbers being always stop words Punctuation never affecting tokens

Q14. A positional index stores, for each posting: Only the document ID Document ID and term frequency Document ID and positions of the term Only the term positions globally

Q15. Phrase queries (e.g., "to be or not") are supported efficiently by: Non-positional postings Positional indexes Only wildcard indexes Soundex dictionaries

Q16. Skip pointers are used to: Compress term strings Jump ahead in postings lists to speed intersections Resolve synonyms Match wildcard suffixes

Q17. The distinction between type and token implies: Types ≥ Tokens always Tokens ≥ Types always Tokens = Types in all corpora No relation exists

Q18. Case-folding is an example of: Lossy preprocessing Lossless compression Index pruning Query expansion only

Q19. Handling multi-language documents primarily affects: Stop word selection and normalization Hardware selection Pagination Only ranking

Q20. A postings list for term t contains: All terms co-occurring with t Document IDs (and possibly positions) where t occurs Only frequencies of t The entire documents

III. Dictionaries & Tolerant Retrieval (Q21–Q30)

Q21. A key downside of hash tables for term dictionaries is: Slow average lookup Poor support for prefix range queries Excessive memory per entry Need for rebalancing

Q22. B-trees are preferred over binary trees for term dictionaries because they: Have O(1) lookup Reduce rebalancing and are disk-friendly Support hashing Eliminate the need for postings

Q23. The permuterm index is primarily used to support: Spelling correction Phrase queries Wildcard queries with internal * Semantic matching

Q24. For wildcard query mon* with a B-tree dictionary, a system fetches all terms in range: mon ≤ t ≤ mon mon ≤ t < moo moa ≤ t < moz *mon ≤ t < *moo

Q25. Levenshtein distance counts the minimum number of: Insertions and deletions only Insertions, deletions, and substitutions Edits including transpositions only Synonym replacements

Q26. Soundex is mainly designed to: Index numbers Phonetically group similarly sounding names Compress postings Rank documents

Q27. A k-gram index helps with: Efficient tf computation Candidate generation for spelling correction and wildcards Document length normalization Posting list compression

Q28. A limitation of hash-based dictionaries is the need to: Rehash as vocabulary grows Rebalance nodes Maintain skip pointers Store positional indexes

Q29. For wildcard query *X, the permuterm idea allows lookup as: $X* X$* *$X X*$

Q30. “Tolerant retrieval” refers to techniques that: Rank by cosine similarity Handle term mismatches (spelling, variants, wildcards) Prune long postings Compress dictionaries

IV. Index Construction (Q31–Q40)

Q31. A key hardware fact driving IR design is that: Disk seeks are free Memory access is much faster than disk access CPUs are slower than disks Networks are always the bottleneck

Q32. BSBI stands for: Blocked Sort-Based Indexing Binary Sorted Block Index Balanced Split Block Index Buffered Search Block Index

Q33. The SPIMI algorithm’s key idea includes: Sorting all term–doc pairs first Building postings lists on the fly per block using a hash Using suffix arrays Skipping merging steps

Q34. In BSBI, after creating partial indexes for blocks, the next step is to: Discard them Merge sorted runs into a single index Compress documents Re-tokenize the corpus

Q35. A main advantage of SPIMI over BSBI is: Lower memory use by avoiding global term IDs Avoiding disk writes entirely Eliminating merges Handling only positional indexes

Q36. MapReduce helps in indexing primarily by: Performing distributed parse, partition, and invert phases Eliminating reducers Replacing postings Doing query expansion

Q37. Dynamic indexing commonly uses a small in-memory index to: Cache frequent queries Temporarily hold new documents before periodic merges Replace the main index Store only stop words

Q38. Disk I/O efficiency improves by: Many tiny random reads Reading/writing larger contiguous blocks Storing indexes in CSV Using only variable-length terms

Q39. In the Reuters RCV1 example, the number of non-positional postings is on the order of: 10³ 10⁵ 10⁸ 10¹²

Q40. Fault tolerance in large IR systems is often achieved by: One very expensive fault-tolerant machine Many commodity machines with replication No replication needed Tape backups only

V. Index Compression (Q41–Q50)

Q41. The main motivation for dictionary compression is to: Reduce query latency by keeping it in memory Improve spelling correction Support phrase queries Enable wildcard search

Q42. Postings compression improves performance because: Compressed reads + fast decompression > uncompressed disk I/O Decompression is always zero-cost It reduces CPU usage It eliminates caching needs

Q43. Gap encoding stores: Absolute document IDs Differences between consecutive docIDs Only term frequencies Character trigrams

Q44. Variable Byte (VB) coding uses the high bit of each byte to: Indicate sign Indicate continuation vs. last byte Mark stop words Distinguish terms from docs

Q45. Gamma coding represents an integer via: Unary only Binary only A unary length and a binary offset Decimal digits

Q46. Heaps’ Law relates vocabulary size M to number of tokens T as: M ≈ k·T^b M ≈ log T M ≈ T/log T M ≈ √T · log T

Q47. Zipf’s Law states that term frequency is approximately proportional to: Rank 1 / rank log(rank) rank²

Q48. A dictionary-as-string structure typically stores for each term: Term text inline repeatedly Pointers into a concatenated term string plus metadata Only document frequencies Only postings pointers

Q49. Front-coding is especially useful when: Terms share common prefixes Postings are dense Only suffixes repeat There are no shared prefixes

Q50. In Reuters compression examples, γ-coded postings are typically: Larger than VB-coded Smaller than or comparable to VB-coded Equal to raw 32-bit postings Always zero size

VI. Scoring, TF–IDF & Vector Space Model (Q51–Q60)

Q51. Ranked retrieval addresses the Boolean “feast or famine” by: Returning no results if too many Showing the top-k results by score Forcing exact matches Ignoring rare terms

Q52. Term frequency (tf) in a document typically: Decreases with term rarity Is the count of occurrences of a term in that document Is the number of documents containing the term Is binary only

Q53. Inverse document frequency (idf) downweights terms that: Occur in few documents Occur in many documents Are proper nouns Are misspelled

Q54. A common idf formula is: idf = 1/df idf = log(N/df) idf = tf/df idf = √df

Q55. The vector space model represents documents and queries as: Sets Probability trees Vectors in a high-dimensional term space Graphs over terms

Q56. Cosine similarity between vectors q and d is: (q·d) / (||q|| · ||d||) |q ∩ d| / |q ∪ d| ||q - d|| (tf/df)

Q57. A limitation of the Jaccard coefficient for ranking is that it: Ignores document length Ignores term frequency Requires positions Requires tf-idf

Q58. In tf-idf weighting, the weight of term t in document d typically increases with: Higher df(t) Higher tf(t,d) and lower df(t) Lower tf(t,d) and higher df(t) Document length only

Q59. Length normalization in cosine similarity primarily: Favors longer documents Penalizes longer documents to reduce length bias Has no effect Removes idf

Q60. Which statement is TRUE? If no query term appears in a document, cosine similarity is negative If no query term appears, cosine similarity is zero If no query term appears, cosine similarity is one Cosine similarity ignores tf-idf weights

Mock Exam – Introduction to Information Retrieval

I. Boolean Retrieval (Q1–Q10)

II. Vocabulary, Tokenization & Postings (Q11–Q20)

III. Dictionaries & Tolerant Retrieval (Q21–Q30)

IV. Index Construction (Q31–Q40)

V. Index Compression (Q41–Q50)

VI. Scoring, TF–IDF & Vector Space Model (Q51–Q60)