URL Microsoft Teams
Program
The course aims to put the student in contact with advanced techniques relating to the collection of large document collections and the construction of search engines, with particular attention to the web and its ranking mechanisms. The course requires knowledge of basic algorithmic techniques and network protocols. A subset of the following topics will be covered, also based on the interests of the students:
- Anatomy of a search engine and indexing techniques.
- Information harvesting: algorithmic, technical and social problems in creating crawlers.
- Comparing visiting strategies.
- Global web structure and algorithmic tools for its analysis.
- Web graph representation: compression problems.
- Parallel and distributed crawlers. Consistent hashing techniques.
- Network centrality: closeness, harmonic centrality, Katz, PageRank, dominant eigenvector, etc.
- Techniques for the construction of inverse indices.
- Methodologies for the construction of large distributed indices.
- Methodologies for building large dictionaries: minimal perfect order-preserving maps, prefix dictionaries, lexicographic compression.
- Prefix-free codes for index compression: unary, Elias γ, Elias δ, Golomb, etc.
- Representation of Elias–Fano and quasi-succinct indexes.
- Textual ranking techniques (TF/IDF, BM25, LSI, cosine, etc.).
- Substring indices: trie, suffix trees and suffix vectors.
More detailed information on the actual content is provided in the following section. For information/clarifications, you can write to the teacher (Sebastiano Vigna).
Exams
The exam is oral. The date is set by appointment (the appeals are, in fact, purely nominal).
The exam material includes the topics explained in class, excluding the latent semantics indexing and the theoretical parts on eigenvalues/eigenvectors and Markov chains (but clearly you need to have an idea to explain the spectral centralities, PageRank, etc.).
For reference, for the corso di laurea in informatica you can consider first the lecture notes, the slides on web-graph compression, on HyperBall, and on the Elias–Fano encoding. From the Manning (see below) chapter 1, chapter 2 up to 2.4.1 excluded, paragraphs 4.1-4.3, paragraph 5.3 (which partially overlaps with the lecture notes), paragraphs 6.2-6.4.3 and paragraphs 8.1-8.4. Chapters 20 and 21 partially overlap with the lecture notes, but they are more complete and can be a useful additional reading.
Recommended Texts
Since almost all the topics are relatively recent, or researched, it is not possible to give a unitary text. However, there are numerous books, articles and handouts online that cover the topics discussed, first of all the book by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008. The handouts and slides of Manning's Stanford class are a mine of useful information.
Also available are the slides on web-graph compression, those on HyperBall, those on the Elias–Fano representation and quasi-succinct indices and the animated slides of the peeling example.
Network Standards
The standards on which the Internet and the web are based are available in the form of an RFC or recommendations published by the World Wide Web Consortium.
Crawling
The paper describing our crawler discusses several advanced issues, and the code is available.
Construction and compression of indices
Princeton's class on Algorithms for Massive Data Sets provides several useful handouts, in particular on prefix-free codes for inverted indices and models for the description of gaps and Bloom filters.
Exogenous Ranking
The review part of our article on axioms for centrality contains a detailed discussion of the ranking methods discussed in class (Seeley, the dominant eigenvector, Katz, PageRank, proximity, harmonic centrality).
Endogenous Ranking
To understand better a realistic version of BM25, there is a recent paper by Robertson, Zaragoza and Taylor in which the authors extend the formula to the multi-index case.
Product Quantization
There is a very readable survey.Suffix Arrays
The paper by Kärkkäinen, Sanders and Burkhardt describes a relatively simple algorithm that builds a suffix array in linear time.
Software
Many of the course topics describe the algorithms, codes and techniques used in the software produced by member of the LAW. MG4J is a indexing engine; WebGraph is framework for compressing graphs.