The course aims to put the student in contact with advanced techniques relating to the collection of large document collections and the construction of search engines, with particular attention to the web and its ranking mechanisms. The course requires knowledge of basic algorithmic techniques and network protocols. A subset of the following topics will be covered, also based on the interests of the students:
- Anatomy of a search engine and indexing techniques.
- Information harvesting: algorithmic, technical and social problems in creating crawlers.
- Comparing visiting strategies.
- Global web structure and algorithmic tools for its analysis.
- Web graph representation: compression problems.
- Parallel and distributed crawlers. Consistent hashing techniques.
- Network centrality: closeness, harmonic centrality, Katz, PageRank, dominant eigenvector, etc.
- Techniques for the construction of inverse indices.
- Methodologies for the construction of large distributed indices.
- Methodologies for building large dictionaries: minimal perfect order-preserving maps, prefix dictionaries, lexicographic compression.
- Instantaneous codes for index compression: unary, Elias γ, Elias δ, Golomb, etc.
- Representation of Elias–Fano and quasi-succinct indexes.
- Textual ranking techniques (TF/IDF, BM25, LSI, cosine, etc.).
- Substring indices: trie, suffix trees and suffix vectors.
More detailed information on the actual content is provided in the following section. For information/clarifications, you can write to the teacher (Sebastiano Vigna).
The exam is oral. The date is set by appointment (the appeals are, in fact, purely nominal).
The exam material (both for the corso di laurea in informatica and the corso di laurea in data science & economics) includes the topics explained in class, excluding the latent semantics indexing and the theoretical parts on eigenvalues/eigenvectors and Markov chains (but clearly you need to have an idea to explain the spectral centralities, PageRank, etc.).
For reference, for the corso di laurea in informatica you can consider first the lecture notes, the slides on web-graph compression, on HyperBall, and on the Elias–Fano encoding. From the Manning (see below) chapter 1, chapter 2 up to 2.4.1 excluded, paragraphs 4.1-4.3, paragraph 5.3 (which partially overlaps with the lecture notes), paragraphs 6.2-6.4.3 and paragraphs 8.1-8.4. Chapters 20 and 21 partially overlap with the lecture notes, but they are more complete and can be a useful additional reading.
For the corso di laurea in data science & economics you should consider chapters 1–9 of the lecture notes in English.
Since almost all the topics are relatively recent, or researched, it is not possible to give a unitary text. However, there are numerous books, articles and handouts online that cover the topics discussed, first of all the new book by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008. The handouts and slides of Manning's Stanford class are a mine of useful information.
Also available are the slides on web-graph compression, those on HyperBall, those on the Elias–Fano representation and quasi-succinct indices and the animated slides of the peeling example.
Construction and compression of indices
Princeton's class on Algorithms for Massive Data Sets provides several useful handouts, in particular on instantaneous codes for inverted indices and models for the description of gaps and Bloom filters.
The review part of our article on axioms for centrality contains a detailed discussion of the ranking methods discussed in class (Seeley, the dominant eigenvector, Katz, PageRank, proximity, harmonic centrality).
To understand better a realistic version of BM25, there is a recent paper by Robertson, Zaragoza and Taylor in which the authors extend the formula to the multi-index case.
Product QuantizationThere is a very readable survey.
The paper by Kärkkäinen, Sanders and Burkhardt describes a relatively simple algorithm that builds a suffix array in linear time.