Sebastiano Vigna

I received my Laurea in Mathematics from the University of Milano in 1991, and my Ph.D. in Computer Science from the University of Milano in 1996.

Presently I am a full professor at the Dipartimento di Informatica of the Università degli Studi di Milano, and a member of the LAW. You can email me if you want. Information related to the DI (e.g., phone numbers) can be found on my unimi web page.

Current Interests

More in detail...

Compression of web and social graphs

Snapshots of the web generated very large graphs—you need a compact representation. WebGraph is a framework built to this purpose. Among other things, WebGraph uses new instantaneous codes for the integers and new aggressive algorithmic compression techniques.

Nonetheless, compression algorithms need good orderings of the nodes—orderings that expose the inner structure of the graph. We have made some interesting experiments to find new, efficient orderings, which led to the development of Layered Label Propagation—our current technique to permute graphs so to increase their locality.

Analysis of web and social graphs

Web graphs have special properties whose study requires a sizeable amount of mathematics, but also a careful study of actual web graphs. I participated to the study of the largest web snapshot publicly available in 2013, dispelling a number of myths, and leading to the first open ranking of the World Wide Web.

We developed a new algorithm, called HyperBall, that can approximate quickly a number of geometric centralities and the neighbourhood function of large graphs. The algorithm was used to compute for the first time second-order statistics about the distance distribution of a large number of web and social graphs: the results can be accessed at the LAW web site. In particular, HyperBall was used to show that Facebook has just four degrees of separation and to compute our open ranking.

HyperBall has also been used to compute the scores for the first open ranking of Wikipedia, which you can browse using Wikidata categories. The site is also a living example of our axioms for centrality.

If you're interested in PageRank and similar scores for the web, please have a look at this survey about the history of spectral ranking methods. You might be surprised 8^). We have also studied the paradoxical way PageRank evolves during a crawl, and the way PageRank changes depending on the damping factor.

The development of WebGraph and of the software necessary for our large-scale experiments spun a number of other satellite Java projects whose code is available as free software, including fastutil and MG4J.

Pseudorandom number generators

I designed a few pseudorandom number generators based on linear recurrences. In particular, xorshift128+ is now used in the JavaScript engines of Chrome, Firefox, Safari and Microsoft Edge. xoroshiro128+ is my current best shot at a fast, compact, high-quality, all-purpose PRNG, and it is the default generator in Erlang.

Efficient data structures for large datasets

Very large data sets requires efficient compressed data structures. An exciting recent development in this direction is that of succinct data structures, which implement standard data structures using the minimum possible space without slowing down more than by a constant factor. I proposed very efficient implementations of several basic building blocks used in succinct data structures by means of broadword programming. The building blocks are used throughout the code of the Sux and Sux4J projects, which aim at providing off-the-shelf useful basic succinct data structures such as the Elias–Fano representation of monotone sequences, Genuzio–Ottaviano–Vigna functions, and more.

In the same spirit, we devised and implemented (almost) optimal monotone minimal perfect hash functions and other low-memory, high-speed data structures for very large sets of strings.

Indexing of large document collections

MG4J is a framework for building indices of large document collection based on the classical inverted-index approach. The kind of index constructed is very configurable (e.g., you can choose your preferred coding method), and moreover some new research has gone into providing efficient skips and minimal-interval semantics. Its direct counterpart is Archive4J, which makes it possible to access quickly and in a highly compressed form summaries of information about documents in a collection.

High-performance parallel web crawling

I collaborated with Paolo Boldi, Bruno Codenotti and Massimo Santini to the development of UbiCrawler, a scalable, fault-tolerant and fully distributed web crawler. The first report on the design of UbiCrawler won the Best Poster Award at the Tenth World Wide Web Conference. You can also read a more in-depth report.

A new open-source crawler, BUbiNG, developed with Paolo Boldi, Andrea Marino and Massimo Santini is now available at the LAW web site and through Maven.

Automatic generation of software from conceptual models

ERW is an innovative system for handling complex databases using a web browser. It uses the most recent standards endorsed by the W3C to offer to the user a sophisticated environment, similar to a dedicated client. Moreover, the user interface is generated in a completely automatic way starting from a conceptual description of the database by means of an XML-based language for entity-relationship schemata. Customisation of the resulting application is governed by specification percolation.

Computability on the reals

I proved, in collaboration with Paolo Boldi, the first results comparing the Blum–Shub–Smale model (i.e., computation with exact reals) and Type 2 Theory of Effectivity (i.e., feasible discrete approximations). We gave a precise quantification of the gap between the two models using degree theory: in a slogan, equality is a jump. We also introduced a notion of Turing closure for Archimedean fields, which was used to characterize those fields in which Type 2-decidable sets exist.

Computability problems in anonymous networks

Deciding whether a problem can be solved with a certain knowledge in a network where all processors are identically programmed (an anonymous network) is a problem dating about 20 years. The concept of graph fibration gave rise to an effective characterization of the tasks solvable under a wide range of models and knowledge assumptions.

Topological properties of distributed systems

Faster solutions to classical distributed problems are possible if the way processors label the link interconnecting them respects a global coherence property called sense of direction, and formalized by Flocchini, Mans and Santoro (for instance, a mesh with the standard North/South/Est/West labelling has this property). We showed that weak sense of direction can be decided in logarithmic parallel time, and this implies that the same is true of Cayley colour graphs; this is the first complexity result on Cayley graph (the non-coloured case is not known to be polynomial or NP-complete).

We also proved the first lower bounds for weak sense of direction, using noncostructive techniques based on random graph theory, and we extended these results to regular graphs by proving some new properties of random regular graphs of high degree.

Self-stabilizing systems

Self-stabilizing distributed system must converge to a desired behaviour in spite of an arbitrary initial state. By making self-stabilizing the classical view construction algorithm, we showed that the effective characterization for anonymous network can be extended to self-stabilizing systems. As a (surprising) consequence, there is no difference in distributed computational power between an arbitrary initial state and a chosen initial state if the latter is forced to be the same for all processors.