Recent Preprints

  1.  Sebastiano Vigna. A weighted correlation index for rankings with ties. CoRR, abs/1404.3325, 2014.
    Abstract

    Understanding the correlation between two different scores for the same set of items is a common problem in information retrieval, and the most commonly used statistics that quantifies this correlation is Kendall's τ. However, the standard definition fails to capture that discordances between items with high rank are more important than those between items with low rank. Recently, a new measure of correlation based on average precision has been proposed to solve this problem, but like many alternative proposals in the literature it assumes that there are no ties in the scores. This is a major deficiency in a number of contexts, and in particular while comparing centrality scores on large graphs, as the obvious baseline, indegree, has a very large number of ties in web and social graphs. We propose to extend Kendall's definition in a natural way to take into account weights in the presence of ties. We prove a number of interesting mathematical properties of our generalization and describe an O(n log n) algorithm for its computation. We also validate the usefulness of our weighted measure of correlation using experimental data.

    PDF version.
  2.  Sebastiano Vigna. Supremum-norm convergence for step-asynchronous successive overrelaxation on M-matrices. CoRR, abs/1404.3327, 2014.
    Abstract

    Step-asynchronous successive overrelaxation updates the values contained in a single vector using the usual Gauß–Seidel-like weighted rule, but arbitrarily mixing old and new values, the only constraint being temporal coherence—you cannot use a value before it has been computed. We show that given a nonnegative real matrix A, a σ ≥ ρ(A) and a vector w > 0 such that Aw ≤ σw, every iteration of step-asynchronous successive overrelaxation for the problem (sI − A)x = b, with s > σ reduces geometrically the w-norm of the current error by a factor that we can compute explicitly. Then, we show that given a σ > ρ(A) it is in principle always possible to compute such a w. This property makes it possible to estimate the supremum norm of the absolute error at each iteration without any additional hypothesis on A, even when A is so large that computing the product Ax is feasible, but estimating the supremum norm of (sI − A)-1 is not.

    PDF version.
  3.  Sebastiano Vigna. Further scramblings of Marsaglia's xorshift generators. CoRR, abs/1402.6246, 2014.
    Abstract

    xorshift* generators are a variant of Marsaglia's xorshift generators that eliminate linear artifacts typical of generators based on Z/2Z-linear operations using multiplication by a suitable constant. Shortly after high-dimensional xorshift* generators were introduced, Saito and Matsumoto suggested a different way to eliminate linear artifacts based on addition in Z/232Z, leading to the XSadd generator. Starting from the observation that the lower bits of XSadd are very weak, as its reverse fails systematically several statistical tests, we explore xorshift+, a variant of XSadd using 64-bit operations, which leads, in small dimension, to extremely fast high-quality generators.

    PDF version; more data can be found at the xorshift*/xorshift+ and the PRNG shootout page.
  4.  Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, and Christian Bizer. Graph structure in the web — Revisited, or a trick of the heavy tail. In WWW'14 Companion, pages 427−432. International World Wide Web Conferences Steering Committee, 2014.
    Abstract

    Knowledge about the general graph structure of the World Wide Web is important for understanding the social mechanisms that govern its growth, for designing ranking methods, for devising better crawling algorithms, and for creating accurate models of its structure. In this paper, we describe and analyse a large, publicly accessible crawl of the web that was gathered by the Common Crawl Foundation in 2012 and that contains over 3.5 billion web pages and 128.7 billion links. This crawl makes it possible to observe the evolution of the underlying structure of the World Wide Web within the last 10 years: we analyse and compare, among other features, degree distributions, connectivity, average distances, and the structure of weakly/strongly connected components.

    Our analysis shows that, as evidenced by previous research, some of the features previously observed by Broder et al. are very dependent on artefacts of the crawling process, whereas other appear to be more structural. We confirm the existence of a giant strongly connected component; we however find, as observed by other researchers, very different proportions of nodes that can reach or that can be reached from the giant component, suggesting that the “bow-tie structure” as described by Broder et al. is strongly dependent on the crawling process, and to the best of our current knowledge is not a structural property of the web.

    More importantly, statistical testing and visual inspection of size-rank plots show that the distributions of indegree, outdegree and sizes of strongly connected components are not power laws, contrarily to what was previously reported for much smaller crawls, although they might be heavy tailed. We also provide for the first time accurate measurement of distance-based features, using recently introduced algorithms that scale to the size of our crawl.

    PDF version.
  5.  Sebastiano Vigna. An experimental exploration of Marsaglia's xorshift generators, scrambled. CoRR, abs/1402.6246, 2014.
    Abstract

    Marsaglia proposed recently xorshift generators as a class of very fast, good-quality pseudorandom number generators. Subsequent analysis by Panneton and L'Ecuyer has lowered the expectations raised by Marsaglia's paper, showing several weaknesses of such generators, verified experimentally using the TestU01 suite. Nonetheless, many of the weaknesses of xorshift generators fade away if their result is scrambled by a non-linear operation (as originally suggested by Marsaglia). In this paper we explore the space of possible generators obtained by multiplying the result of a xorshift generator by a suitable constant. We sample generators at 100 equispaced points of their state space and obtain detailed statistics that lead us to choices of parameters that improve on the current ones. We then explore for the first time the space of high-dimensional xorshift generators, following another suggestion in Marsaglia's paper, finding choices of parameters providing periods of length 21024 − 1 and 24096 − 1. The resulting generators are of extremely high quality, faster than current similar alternatives, and generate long-period sequences passing strong statistical tests using only eight logical operations, one addition and one multiplication by a constant.

    PDF version; more data can be found at the xorshift*/xorshift+ and the PRNG shootout page.
  6.  Djamal Belazzougui, Paolo Boldi, Giuseppe Ottaviano, Rossano Venturini, and Sebastiano Vigna. Cache-oblivious peeling of random hypergraphs. CoRR, abs/1312.0526, 2014.
    Abstract

    The computation of a peeling order in a randomly generated hypergraph is the most time-consuming step in a number of constructions, such as perfect hashing schemes, random r-SAT solvers, error-correcting codes, and approximate set encodings. While there exists a straightforward linear time algorithm, its poor I/O performance makes it impractical for hypergraphs whose size exceeds the available internal memory.

    We show how to reduce the computation of a peeling order to a small number of sequential scans and sorts, and analyze its I/O complexity in the cache-oblivious model. The resulting algorithm requires O(sort(n)) I/Os and O(n log n) time to peel a random hypergraph with n edges.

    We experimentally evaluate the performance of our implementation of this algorithm in a real-world scenario by using the construction of minimal perfect hash functions (MPHF) as our test case: our algorithm builds a MPHF of 7.6 billion keys in less than 21 hours on a single machine. The resulting data structure is both more space-efficient and faster than that obtained with the current state-of-the-art MPHF construction for large-scale key sets.

    PDF version.
  7.  Paolo Boldi and Sebastiano Vigna. The push algorithm for spectral ranking. CoRR, abs/1109.4680, 2014.
    Abstract

    The push algorithm was proposed first by Jeh and Widom in the context of personalized PageRank computations (albeit the name “push algorithm” was actually used by Andersen, Chung and Lang in a subsequent paper). In this note we describe the algorithm at a level of generality that make the computation of the spectral ranking of any nonnegative matrix possible. Actually, the main contribution of this note is that the description is very simple (almost trivial), and it requires only a few elementary linear-algebra computations. Along the way, we give new precise ways of estimating the convergence of the algorithm, and describe some of the contribution of the existing literature, which again turn out to be immediate when recast in our framework.

    PDF version.
  8.  Sebastiano Vigna. Fibonacci binning. CoRR, abs/1312.3749, 2013.
    Abstract

    This note argues that when dot-plotting distributions typically found in papers about web and social networks (degree distributions, component-size distributions, etc.), and more generally distributions that have high variability in their tail, an exponentially binned version should always be plotted, too, and suggests Fibonacci binning as a visually appealing, easy-to-use and practical choice.

    PDF version; a Ruby script implementing Fibonacci binning.
  9.  Paolo Boldi and Sebastiano Vigna. Axioms for centrality. CoRR, abs/1308.2140, 2013. To appear in Internet Mathematics.
    Abstract

    Given a social network, which of its nodes are more central? This question has been asked many times in sociology, psychology and computer science, and a whole plethora of centrality measures (a.k.a. centrality indices, or rankings) were proposed to account for the importance of the nodes of a network. In this paper, we try to provide a mathematically sound survey of the most important classic centrality measures known from the literature and propose an axiomatic approach to establish whether they are actually doing what they have been designed for. Our axioms suggest some simple, basic properties that a centrality measure should exhibit.

    Surprisingly, only a new simple measure based on distances, harmonic centrality, turns out to satisfy all axioms; essentially, harmonic centrality is a correction to Bavelas's classic closeness centrality designed to take unreachable nodes into account in a natural way.

    As a sanity check, we examine in turn each measure under the lens of information retrieval, leveraging state-of-the-art knowledge in the discipline to measure the effectiveness of the various indices in locating web pages that are relevant to a query. While there are some examples of such comparisons in the literature, here for the first time we also take into consideration centrality measures based on distances, such as closeness, in an information-retrieval setting. The results closely match the data we gathered using our axiomatic approach.

    Our results suggest that centrality measures based on distances, which in the last years have been neglected in information retrieval in favor of spectral centrality measures, do provide high-quality signals; moreover, harmonic centrality pops up as an excellent general-purpose centrality index for arbitrary directed graphs.

    PDF version.
  10.  Paolo Boldi and Sebastiano Vigna. In-core computation of geometric centralities with HyperBall: A hundred billion nodes and beyond. In Proc. of 2013 IEEE 13th International Conference on Data Mining Workshops (ICDMW 2013). IEEE, 2013.
    Abstract

    Given a social network, which of its nodes are more central? This question has been asked many times in sociology, psychology and computer science, and a whole plethora of centrality measures (a.k.a. centrality indices, or rankings) were proposed to account for the importance of the nodes of a network. In this paper, we approach the problem of computing geometric centralities, such as closeness and harmonic centrality, on very large graphs; traditionally this task requires an all-pairs shortest-path computation in the exact case, or a number of breadth-first traversals for approximated computations, but these techniques yield very weak statistical guarantees on highly disconnected graphs. We rather assume that the graph is accessed in a semi-streaming fashion, that is, that adjacency lists are scanned almost sequentially, and that a very small amount of memory (in the order of a dozen bytes) per node is available in core memory. We leverage the newly discovered algorithms based on HyperLogLog counters, making it possible to approximate a number of geometric centralities at a very high speed and with high accuracy. While the application of similar algorithms for the approximation of closeness was attempted in the MapReduce framework, our exploitation of HyperLogLog counters reduces exponentially the memory footprint, paving the way for in-core processing of networks with a hundred billion nodes using “just” 2TiB of RAM. Moreover, the computations we describe are inherently parallelizable, and scale linearly with the number of available cores.

    PDF version.
  11.  Sebastiano Vigna. Spectral ranking. CoRR, abs/0912.0238, 2009.
    Abstract

    This note tries to attempt a sketch of the history of spectral ranking—a general umbrella name for techniques that apply the theory of linear maps (in particular, eigenvalues and eigenvectors) to matrices that do not represent geometric transformations, but rather some kind of relationship between entities. Albeit recently made famous by the ample press coverage of Google's PageRank algorithm, spectral ranking was devised more than fifty years ago, almost exactly in the same terms, and has been studied in psychology and social sciences. I will try to describe it in precise and modern mathematical terms, highlighting along the way the contributions given by previous scholars.

    Disclaimer

    This is is a work in progress with no claim of completeness. I have tried to collect evidence of spectral techniques in ranking from a number of sources, providing a unified mathematical framework that should make it possible to understand in a precise way the relationship between contributions. Reports of inaccuracies and missing references are more than welcome.

    PDF version.
  12.  Sebastiano Vigna. Broadword implementation of parenthesis queries. Arxiv preprint arxiv:1301.5468, 2013.
    Abstract

    We continue the line of research started in “Broadword Implementation of Rank/Select Queries” proposing broadword (a.k.a. SWAR—“SIMD Within A Register”) algorithms for finding matching closed parentheses and the k-th far closed parenthesis. Our algorithms work in time O(log w) on a word of w bits, and contain no branch and no test instruction. On 64-bit (and wider) architectures, these algorithms make it possible to avoid costly tabulations, while providing a very significant speedup with respect to for-loop implementations.

    PDF version; a complete C++ and Java™ implementation is available at the Sux project home page.

Journals

  1.  Paolo Boldi, Marco Rosa, and Sebastiano Vigna. Robustness of social and web graphs to node removal. Social Network Analysis and Mining, 3:829−842, 2013.
    Abstract

    Given a social network, which of its nodes have a stronger impact in determining its structure? More precisely: which node-removal order has the greatest impact on the network structure? We approach this well-known problem for the first time in a setting that combines both web graphs and social networks. Our experiments are performed on datasets that are orders of magnitude larger than those appearing in the previous literature: this is possible thanks to some recently developed algorithms and software tools that approximate accurately the number of reachable pairs and the distribution of distances in large graphs. Our experiments highlight deep differences in the structure of social networks and web graphs, show significant limitations of previous experimental results; at the same time, they reveal clustering by label propagation as a new and very effective way of locating nodes that are important from a structural viewpoint.

    Public datasets are available at the LAW site; Java™ implementations are available as free software at the WebGraph home page.
  2.  Paolo Boldi and Sebastiano Vigna. E = I + T: The internal extent formula for compacted tries. Inform. Process. Lett., 111:310−313, 2011.
    Abstract

    It is well known that in a binary tree the external path length minus the internal path length is exactly 2n − 2, where n is the number of external nodes. We show that a generalization of the formula holds for compacted tries, replacing the role of paths with the notion of extent, and the value 2n − 2 with the trie measure, an estimation of the number of bits that are necessary to describe the trie.

    PDF version.
  3.  Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Theory and practice of monotone minimal perfect hashing. ACM Journal of Experimental Algorithmics, 16(3):3.2:1−3.2:26, 2011.
    Abstract

    Minimal perfect hash functions have been shown to be useful to compress data in several data management tasks. In particular, order-preserving minimal perfect hash functions have been used to retrieve the position of a key in a given list of keys: however, the ability to preserve any given order leads to an unavoidable Ω(n log n) lower bound on the number of bits required to store the function. Recently, it was observed that very frequently the keys to be hashed are sorted in their intrinsic (i.e., lexicographical) order. This is typically the case of dictionaries of search engines, list of URLs of web graphs, etc. We refer to this restricted version of the problem as monotone minimal perfect hashing. We analyse experimentally the data structures proposed in our paper “Monotone Minimal Perfect Hashing: Searching a Sorted Table with O(1) Accesses”, and along our way we propose some new methods that, albeit asymptotically equivalent or worse, perform very well in practice, and provide a balance between access speed, ease of construction, and space usage.

    PDF version. The algorithms described in the paper are implemented in Sux4J.
  4.  Paolo Boldi, Francesco Bonchi, Carlos Castillo, and Sebastiano Vigna. Viscous democracy for social networks. Commun. ACM, 54:129−137, June 2011.
  5.  Paolo Boldi, Francesco Bonchi, Carlos Castillo, and Sebastiano Vigna. Query reformulation mining: models, patterns, and applications. Information Retrieval, 14:257−289, 2011.
    Abstract

    Understanding query reformulation patterns is a key task towards next generation web search engines. If we can do that, then we can build systems able to understand and possibly predict user intent, providing the needed assistance at the right time, and thus helping users locate information more effectively and improving their web-search experience. As a step in this direction, we build a very accurate model for classifying user query reformulations into broad classes (generalization, specialization, error correction or parallel move), achieving 92% accuracy. We then apply the model to automatically label two very large query logs sampled from different geographic areas, and containing a total of approximately 17 million query reformulations. We study the resulting reformulation patterns, matching some results from previous studies performed on smaller manually annotated datasets, and discovering new interesting reformulation patterns, including connections between reformulation types and topical categories. We annotate two large query-flow graphs with reformulation type information, and run several graph-characterization experiments on these graphs, extracting new insights about the relationships between the different query reformulation types. Finally we study query recommendations based on short random walks on the query-flow graphs. Our experiments show that these methods can match in precision, and often improve, recommendations based on query-click graphs, without the need of users' clicks. Our experiments also show that it is important to consider transition-type labels on edges for having recommendations of good quality.

    Online version.
  6.  Paolo Boldi, Massimo Santini, and Sebastiano Vigna. Permuting web and social graphs. Internet Math., 6(3):257−283, 2010.
    Abstract

    Since the first investigations on web graph compression, it has been clear that the ordering of the nodes of the graph has a fundamental influence on the compression rate (usually expressed as the number of bits per link). The authors of the LINK database, for instance, investigated three different approaches: an extrinsic ordering (URL ordering) and two intrinsic (or coordinate-free) orderings based on the rows of the adjacency matrix (lexicographic and Gray code); they concluded that URL ordering has many advantages in spite of a small penalty in compression. In this paper we approach this issue in a more systematic way, testing some old orderings and proposing some new ones. Our experiments are made in the WebGraph framework, and show that the compression technique and the structure of the graph can produce significantly different results. In particular, we show that for the transpose web graph URL ordering is significantly less effective, and that some new orderings combining host information and Gray/lexicographic orderings outperform all previous methods. In particular, in some large transposed graphs they yield the quite incredible compression rate of 1 bit per link.

    PDF version; datasets and Java™ implementations are available as free software at the LAW site.
  7.  Paolo Boldi, Massimo Santini, and Sebastiano Vigna. PageRank: Functional dependencies. ACM Trans. Inf. Sys., 27(4):1−23, 2009.
    Abstract

    PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α=0.85 by Brin and Page is still used. In this paper, we give a mathematical analysis of PageRank when α changes. In particular, we show that, contrarily to popular belief, for real-world graphs values of α close to 1 do not give a more meaningful ranking. Then, we give closed-form formulae for PageRank derivatives of any order, and by proving that the k-th iteration of the Power Method gives exactly the value obtained by truncating the PageRank power series at degree k, we show how to obtain an approximation of the derivatives. Finally, we view PageRank as a linear operator acting on the preference vector and show a tight connection between iterated computation and derivation.

    PDF version.
  8.  Paolo Boldi, Flavio Chierichetti, and Sebastiano Vigna. Pictures from mongolia. Extracting the top elements from a partially ordered set. Theory Comput. Systems, 44(2):269−288, 2009.
    Abstract

    You are back from that very long, marvelous journey. You have a thousand pictures, but your friends and relatives will stand just a few dozens. Choosing is a painful process, in particular when you cannot decide between the silent vastity of that desert and the idyllic picture of that tranquil, majestic lake. We are going to help.

    PDF version.
  9.  Paolo Boldi, Violetta Lonati, Massimo Santini, and Sebastiano Vigna. Graph fibrations, graph isomorphism, and PageRank. RAIRO Inform. Théor., 40:227−253, 2006.
    Abstract

    PageRank is a ranking method that assigns scores to web pages using the limit distribution of a random walk on the web graph. A fibration of graphs is a morphism that is a local isomorphism of in-neighbourhoods, much in the same way a covering projection is a local isomorphism of neighbourhoods. We show that a deep connection relates fibrations and Markov chains with restart, a particular kind of Markov chains that include the PageRank one as a special case. This fact provides constraints on the values that PageRank can assume. Using our results, we show that a recently defined class of graphs that admit a polynomial-time isomorphism algorithm based on the computation of PageRank is really a subclass of fibration-prime graphs, which possess simple, entirely discrete polynomial-time isomorphism algorithms based on classical techniques for graph isomorphism. We discuss efficiency issues in the implementation of such algorithms for the particular case of web graphs, in which O(n) space occupancy (where n is the number of nodes) may be acceptable, but O(m) is not (where m is the number of arcs).

    PDF version.
  10.  Paolo Boldi, Massimo Santini, and Sebastiano Vigna. Paradoxical effects in PageRank incremental computations. Internet Math., 2(3):387−404, 2005.
    Abstract

    Deciding which kind of visit accumulates high-quality pages more quickly is one of the most often debated issue in the design of web crawlers. It is known that breadth-first visits work well, as they tend to discover pages with high PageRank early on in the crawl. Indeed, this visit order is much better than depth first, which is in turn even worse than a random visit; nevertheless, breadth-first can be superseded using an omniscient visit that chooses, at every step, the node of highest PageRank in the frontier.

    This paper discusses a related, and previously overlooked, measure of effectivity for crawl strategies: whether the graph obtained after a partial visit is in some sense representative of the underlying web graph as far as the computation of PageRank is concerned. More precisely, we are interested in determining how rapidly the computation of PageRank over the visited subgraph yields relative ranks that agree with the ones the nodes have in the complete graph; ranks are compared using Kendall's τ.

    We describe a number of large-scale experiments that show the following paradoxical effect: visits that gather PageRank more quickly (e.g., highest-quality-first) are also those that tend to miscalculate PageRank. Finally, we perform the same kind of experimental analysis on some synthetic random graphs, generated using well-known web-graph models: the results are almost opposite to those obtained on real web graphs.

    PDF version; PostScript version; Java™ classes for computing Kendall's τ efficiently are available as free software at the LAW site.
  11.  Paolo Boldi and Sebastiano Vigna. Codes for the World−Wide Web. Internet Math., 2(4):405−427, 2005.
    Abstract

    We introduce a new family of simple, complete instantaneous codes for positive integers, called ζ codes, which are suitable for integers distributed as a power law with small exponent (smaller than 2). The main motivation for the introduction of ζ codes comes from web-graph compression: if nodes are numbered according to URL lexicographical order, gaps in successor lists are distributed according to a power law with small exponent. We give estimates of the expected length of ζ codes against power-law distributions, and compare the results with analogous estimates for the more classical γ, δ and variable-length block codes.

    PDF version (technical report); PostScript version (technical report); to download software and data sets, please look at the WebGraph home page.
  12.  Paolo Boldi and Sebastiano Vigna. Mutable strings in Java: Design, implementation and lightweight text-search algorithms. Sci. Comput. Programming, 54(1):3−23, 2005.
    Abstract

    The Java string classes, String and StringBuffer, lie at the extremes of a spectrum (immutable, reference-based and mutable, content-based). Analogously, available text-search methods on string classes are implemented either as trivial, brute-force double loops, or as very sophisticated and resource-consuming regular-expression search methods. Motivated by our experience in data-intensive text applications, we propose a new string class, MutableString, which tries to get the right balance between extremes in both cases. Mutable strings can be in one of two states, compact and loose, in which they behave more like String and StringBuffer, respectively. Moreover, they support a wide range sophisticated text-search algorithms with a very low resource usage and setup time, using a new, very simple randomised data structure (a generalisation of Bloom filters) that stores an approximation from above of a lattice-valued function. Computing the function value requires a constant number of steps, and the error probability can be balanced with space usage. As a result, we obtain practical implementations of Boyer-Moore type algorithms that can be used with very large alphabets, such as Unicode collation elements. The techniques we develop are very general and amenable to a wide range of applications.

    PDF version; PostScript version; to download MutableString and TextPattern please look at the MG4J project home page.
  13.  Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. UbiCrawler: A scalable fully distributed web crawler. Software: Practice & Experience, 34(8):711−726, 2004.
    Abstract

    We report our experience in implementing UbiCrawler, a scalable distributed web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some limitation of the Java APIs, which prompted the authors to partially reimplement them.

    PDF version; PostScript version; a Java™ class for computing consistent hashing is available as free software at the LAW site.
  14.  Paolo Boldi and Sebastiano Vigna. Lower bounds for sense of direction in regular graphs. Distr. Comput., 16(4):279−286, 2003.
    Abstract

    A graph G with n vertices and maximum degree ΔG cannot be given weak sense of direction using less than ΔG colours. It is known that n colours are always sufficient, and it was conjectured that just ΔG+1 are really needed, that is, one more colour is sufficient. Nonetheless, it has just been shown that for sufficiently large n there are graphs requiring ω(n/log n) more colours than ΔG. In this paper, using recent results in asymptotic graph enumeration, we show not only that (somehow surprisingly) the same bound holds for regular graphs, but also that it can be improved to Ω(n log log n/log n) We also show that Ω(dG(log log dG)1/2 colours are necessary, where dG is the degree of G.

    PDF version; PostScript version.
  15.  Paolo Boldi and Sebastiano Vigna. Lower bounds for weak sense of direction. J. Discrete Algorithms, 1:119−128, 2003.
    Abstract

    A graph with n and maximum degree Δ cannot be given weak sense of direction using less than Δ colours. It is known that n colours are always sufficient, but it has been conjectured that just Δ+1 are really needed. On the contrary, we show that for sufficiently large n there are graphs requiring Δ+ω(n/log n) colours. We also give simple examples of small graphs requiring Δ+2 colours, which have been verified mechanically.

    PDF version; PostScript version.
  16.  Paolo Boldi and Sebastiano Vigna. Universal dynamic synchronous self-stabilization. Distr. Comput., 15(3):137−153, 2002.
    Abstract

    We prove the existence of a "universal" synchronous self-stabilizing protocol, that is, a protocol that allows a distributed system to stabilize to a desired nonreactive behaviour (as long as a protocol stabilizing to that behaviour exists). Previous proposals required drastic increases in asymmetry and knowledge to work, whereas our protocol does not use any additional knowledge, and does not require more symmetry-breaking conditions than available; thus, it is also stabilizing with respect to dynamic changes in the topology. We prove an optimal quiescence time n+D for a synchronous network of n processors and diameter D; the protocol can be made finite state with a negligible loss in quiescence time. Moreover, an optimal D+1 protocol is given for the case of unique identifiers. As a consequence, we provide an effective proof technique that allows to show whether self-stabilization to a certain behaviour is possible under a wide range of models.

    PDF version; PostScript version.
  17.  Paolo Boldi and Sebastiano Vigna. Complexity of deciding sense of direction. SIAM J. Comput., 29(3):779−789, 2000.
    Abstract

    In this paper we prove that deciding whether a distributed system (represented as a coloured digraph with n nodes) has weak sense of direction is in AC1 (using n6 processors). Moreover, we show that deciding sense of direction is in P. Our algorithms can also be used to decide in AC1 whether a coloured graph is a Cayley colour graph.

    PDF version; PostScript version.
  18.  Paolo Boldi and Sebastiano Vigna. Fibrations of graphs. Discrete Math., 243:21−66, 2002.
    Abstract

    A fibration of graphs is a morphism that is a local isomorphism of in-neighbourhoods, much in the same way a covering projection is a local isomorphism of neighbourhoods. This paper develops systematically the theory of graph fibrations, emphasizing in particular those results that recently found application in the theory of distributed systems.

    PDF version; PostScript version. See also the Graph-fibrations home page.
  19.  Paolo Boldi and Sebastiano Vigna. δ-uniform BSS machines. J. Complexity, 14(2):234−256, 1998.
    Abstract

    A δ-uniform BSS machine is a standard BSS machine which does not rely on exact equality tests. We prove that, for any real closed archimedean field R, a set is δ-uniformly semi-decidable iff it is open and semi-decidable by a BSS machine which is locally time bounded; we also prove that the local time bound is nontrivial. This entails a number of results about BSS machines, in particular the existence of decidable sets whose interior (closure) is not even semi-decidable without adding constants. Finally, we show that the sets semi-decidable by Turing machines are the sets semi-decidable by δ-uniform machines with coefficients in Q or T, the field of Turing computable numbers.

    PDF version; PostScript version.
  20.  Paolo Boldi and Sebastiano Vigna. Equality is a jump. Theoretical Computer Science, 219(1−2):49−64, 1999.
    Abstract

    We define a notion of degree of unsolvability for subsets of Rn (where R is a real closed Archimedean field) and prove that, in contrast to Type 2 computability, the presence of exact equality in the BSS model forces exactly one jump of the unsolvability degree of decidable sets.

    PDF version; PostScript version.
  21.  Sebastiano Vigna. On the relations between distributive computability and the BSS model. Theoretical Computer Science, 162:5−21, 1996.
    Abstract

    This paper presents an equivalence result between computability in the BSS model and in a suitable distributive category. It is proved that the class of functions RmRn (with n,m finite and R a commutative, ordered ring) computable in the BSS model, and the functions distributively computable over a natural distributive graph based on the operations of R coincide. Using this result, a new structural characterization, based on iteration, of the same functions is given.

  22.  Paolo Boldi, Massimo Santini, and Sebastiano Vigna. Measuring with jugs. Theoretical Computer Science, 282(2):259−270, 2002.
    Abstract

    We study the jug problem in its most general form: given a set of jugs of fixed capacities, find out which quantities are measurable, and provide upper and lower bounds on the number of steps necessary for measurements.

    PDF version; PostScript version.
  23.  Paolo Boldi and Sebastiano Vigna. The Turing closure of an Archimedean field. Theoretical Computer Science, 231:143−156, 2000.
    Abstract

    A BSS machine is δ-uniform if it does not use exact tests; such machines are equivalent (modulo parameters) to Type 2 Turing machines. We define a notion of closure related to Turing machines for archimedean fields, and show that such fields admit nontrivial δ-uniformly decidable sets iff they are not Turing closed. Then, the partially ordered set of Turing closed fields is proved isomorphic to the ideal completion of unsolvability degrees.

    PDF version; PostScript version.
    Warning: This paper has been plagiarised.
  24.  Paolo Boldi and Sebastiano Vigna. Minimal sense of direction and decision problems for Cayley graphs. Inform. Process. Lett., 64(6):299−303, 1997.
    Abstract

    Sense of direction is a property of the labelling of (possibly anonymous) networks which allows to assign coherently local identifiers to other processors on the basis of the route followed by incoming messages. A graph has minimal sense of direction whenever it has sense of direction and the number of colours equals its maximum outdegree. We prove that an outregular digraph with minimal weak sense of direction is a Cayley colour graph (in the general sense, i.e., we do not require connectedness). Since Cayley colour graphs are known to possess minimal transitive sense of direction, we obtain a characterization of outregular graphs with minimal (weak,transitive) sense of direction. As a consequence, deciding whether a coloured graph is a Cayley colour graph reduces to deciding whether it has weak sense of direction, which can be done in AC1.

    PDF version; PostScript version.
  25.  Bruno Codenotti, Ivan Gerace, and Sebastiano Vigna. Hardness results and spectral techniques for combinatorial problems on circulant graphs. Linear Algebra Appl., 285(1−3):123−142, 1998.
    Abstract

    We show that computing (and even approximating) maximum clique and minimum graph coloring for circulant graphs is essentially as hard as in the general case. In contrast, we show that, under additional constraints, e.g., prime order and/or spareness, graph isomorphism and minimum graph coloring become easier in the circulant case, and we take advantage of spectral techniques for their efficient computation.

    PDF version; PostScript version.
  26.  Paolo Boldi and Sebastiano Vigna. Coverings that preserve sense of direction. Inform. Process. Lett., 75:175−180, 2000.
    Abstract

    Sense of direction is a property of labelled networks (i.e., arc-coloured graphs) that allows one to assign coherently local identifiers to other processors on the basis of the route followed by incoming messages. We prove that (weak) sense of direction is preserved by the construction of regular coverings (i.e., coverings induced by voltage assignments in a group) whose voltage assignment depends only on colours. Moreover, this construction preserves minimality.

    PDF version; PostScript version.
  27.  Stefano Kasangian and Sebastiano Vigna. The topos of labelled trees: A categorical semantics for SCCS. Fund. Inform., 32:27−45, 1997.
  28.  Nicoletta Sabadini, Sebastiano Vigna, and Robert F.C. Walters. A note on recursive functions. Math. Struct. Comp. Sci., 6:127−139, 1996.

Conference Proceedings

  1.  Paolo Boldi and Sebastiano Vigna. Four degrees of separation, really. In Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), pages 1222−1227. IEEE Computer Society, 2012.
    Abstract

    We recently measured the average distance of users in the Facebook graph, spurring comments in the scientific community as well as in the general press (“Four Degrees of Separation”). A number of interesting criticisms have been made about the meaningfulness, methods and consequences of the experiment we performed. In this paper we want to discuss some methodological aspects that we deem important to underline in the form of answers to the questions we have read in newspapers, magazines, blogs, or heard from colleagues. We indulge in some reflections on the actual meaning of “average distance” and make a number of side observations showing that, yes, 3.74 “degrees of separation” are really few.

    PDF version.
  2.  Sebastiano Vigna. Quasi-succinct indices. In Stefano Leonardi, Alessandro Panconesi, Paolo Ferragina, and Aristides Gionis, editors, Proceedings of the 6th ACM International Conference on Web Search and Data Mining, WSDM'13, pages 83−92. ACM, 2013.
    Abstract

    Compressed inverted indices in use today are based on the idea of gap compression: documents pointers are stored in increasing order, and the gaps between successive document pointers are stored using suitable codes which represent smaller gaps using less bits. Additional data such as counts and positions is stored using similar techniques. A large body of research has been built in the last 30 years around gap compression, including theoretical modeling of the gap distribution, specialized instantaneous codes suitable for gap encoding, and ad hoc document reorderings which increase the efficiency of instantaneous codes. This paper proposes to represent an index using a different architecture based on quasi-succinct representation of monotone sequences. We show that, besides being theoretically elegant and simple, the new index provides expected constant-time operations, space savings, and, in practice, significant performance improvements on conjunctive, phrasal and proximity queries.

    PDF version. Quasi-succinct indices are now the default indices in MG4J. You can download the additional code used to perform the benchmarks described in the paper.
  3.  Lars Backstrom, Paolo Boldi, Marco Rosa, Johan Ugander, and Sebastiano Vigna. Four degrees of separation. In ACM Web Science 2012: Conference Proceedings, pages 45−54. ACM Press, 2012. Best paper award.
    Abstract

    Frigyes Karinthy, in his 1929 short story “Láancszemek” (“Chains”) suggested that any two persons are distanced by at most six friendship links. (The exact wording of the story is slightly ambiguous: “He bet us that, using no more than five individuals, one of whom is a personal acquaintance, he could contact the selected individual […]”. It is not completely clear whether the selected individual is part of the five, so this could actually allude to distance five or six in the language of graph theory, but the “six degrees of separation” phrase stuck after John Guare's 1990 eponymous play. Following Milgram's definition and Guare's interpretation, we will assume that “degrees of separation” is the same as “distance minus one”, where “distance” is the usual path length—the number of arcs in the path.) Stanley Milgram in his famous experiment challenged people to route postcards to a fixed recipient by passing them only through direct acquaintances. The average number of intermediaries on the path of the postcards lay between 4.4 and 5.7, depending on the sample of people chosen.

    We report the results of the first world-scale social-network graph-distance computations, using the entire Facebook network of active users (≈721 million users, ≈69 billion friendship links). The average distance we observe is 4.74, corresponding to 3.74 intermediaries or “degrees of separation”, showing that the world is even smaller than we expected, and prompting the title of this paper. More generally, we study the distance distribution of Facebook and of some interesting geographic subgraphs, looking also at their evolution over time.

    The networks we are able to explore are almost two orders of magnitude larger than those analysed in the previous literature. We report detailed statistical metadata showing that our measurements (which rely on probabilistic algorithms) are very accurate.

    PDF version; Java™ implementations are available as free software at the WebGraph home page; we distribute the HyperANF runs, current degree distributions and property files of the graphs discussed in the paper. The graphs are also described on the LAW dataset page.
  4.  Roi Blanco, Peter Mika, and Sebastiano Vigna. Effective and efficient entity search in RDF data. In Lora Aroyo, Chris Welty, Harith Alani, Jamie Taylor, Abraham Bernstein, Lalana Kagal, Natasha Noy, and Eva Blomqvist, editors, The Semantic Web — ISWC 2011. 10th International Semantic Web Conference, Proceedings, Part I, volume 7031 of Lecture Notes in Computer Science, pages 83−97. Springer, 2011.
    Abstract
    Triple stores have long provided RDF storage as well as data access using expressive, formal query languages such as SPARQL. The new end users of the Semantic Web, however, are mostly unaware of SPARQL and overwhelmingly prefer imprecise, informal keyword queries for searching over data. At the same time, the amount of data on the Semantic Web is approaching the limits of the architectures that provide support for the full expressivity of SPARQL. These factors com- bined have led to an increased interest in semantic search, i.e., access to RDF data using Information Retrieval methods. In this work, we propose a method for effective and efficient entity search over RDF data. We describe an adaptation of the BM25F ranking function for RDF data, and demonstrate that it outperforms other state-of-the-art methods in ranking RDF resources. We also propose a set of new index structures for efficient retrieval and ranking of results. We implement these results using the open-source MG4J framework.

  5.  Paolo Boldi, Marco Rosa, and Sebastiano Vigna. Robustness of social networks: Comparative results based on distance distributions. In Social Informatics, Third International Conference, SocInfo 2011, volume 6894 of Lecture Notes in Computer Science, pages 8−21. Springer, 2011.
    Abstract

    Given a social network, which of its nodes have a stronger impact in determining its structure? More formally: which node-removal order has the greatest impact on the network structure? We approach this well-known problem for the first time in a setting that combines both web graphs and social networks, using datasets that are orders of magnitude larger than those appearing in the previous literature, thanks to some recently developed algorithms and software tools that make it possible to approximate accurately the number of reachable pairs and the distribution of distances in a graph. Our experiments highlight deep differences in the structure of social networks and web graphs, show significant limitations of previous experimental results, and at the same time reveal clustering by label propagation as a new and very effective way of locating nodes that are important from a structural viewpoint.

    PDF version; public datasets are available at the LAW site; Java™ implementations are available as free software at the WebGraph home page.
  6.  Paolo Boldi, Marco Rosa, and Sebastiano Vigna. HyperANF: Approximating the neighbourhood function of very large graphs on a budget. In Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, and Ravi Kumar, editors, Proceedings of the 20th international conference on World Wide Web, pages 625−634. ACM, 2011.
    Abstract

    The neighbourhood function NG(t) of a graph G gives, for each t, the number of pairs of nodes <xy> such that y is reachable from x in less that t hops. The neighbourhood function provides a wealth of information about the graph (e.g., it easily allows one to compute its diameter), but it is very expensive to compute it exactly. Recently, the ANF algorithm (approximate neighbourhood function) has been proposed with the purpose of approximating NG(t) on large graphs. We describe a breakthrough improvement over ANF in terms of speed and scalability. Our algorithm, called HyperANF, uses the new HyperLogLog counters and combines them efficiently through broadword programming; our implementation uses decomposition to exploit multi-core parallelism. With HyperANF, for the first time we can compute in a few hours the neighbourhood function of graphs with billions of nodes with a small error and good confidence using a standard workstation. Then, we turn to the study of the distribution of the distances between reachable nodes (that can be efficiently approximated by means of HyperANF), and discover the surprising fact that its index of dispersion provides a clear-cut characterisation of proper social networks vs. web graphs. We thus propose the spid (Shortest-Paths Index of Dispersion) of a graph as a new, informative statistics that is able to discriminate between the above two types of graphs. We believe this is the first proposal of a significant new non-local structural index for complex networks whose computation is highly scalable.

    PDF version; Java™ implementations are available as free software at the WebGraph home page; public datasets are available at the LAW site.
  7.  Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, and Ravi Kumar, editors, Proceedings of the 20th international conference on World Wide Web, pages 587−596. ACM, 2011.
    Abstract

    We continue the line of research on graph compression started with WebGraph, but we move our focus to the compression of social networks in a proper sense (e.g., LiveJournal): the approaches that have been used for a long time to compress web graphs rely on a specific ordering of the nodes (lexicographical URL ordering) whose extension to general social networks is not trivial. In this paper, we propose a solution that mixes clusterings and orders, and devise a new algorithm, called Layered Label Propagation, that builds on previous work on scalable clustering and can be used to reorder very large graphs (billions of nodes). Our implementation uses decomposition to perform aggressively on multi-core architecture, making it possible to reorder graphs of more than 600 millions nodes in a few hours. Experiments performed on a wide array of web graphs and social networks show that combining the order produced by the proposed algorithm with the WebGraph compression framework provides a major increase in compression with respect to all currently known techniques, both on web graphs and on social networks. These improvements make it possible to analyse in main memory significantly larger graphs.

    PDF version; public datasets and Java™ implementations are available as free software at the LAW site.
  8.  Paolo Boldi, Francesco Bonchi, Carlos Castillo, and Sebastiano Vigna. From “dango” to “japanese cakes”: Query reformulation models and patterns. In Proc. of the 2009 IEEE/WIC/ACM International Conferences on Web Intelligence (WI'09). ACM Press, 2009. Winner of the best paper award.
    Abstract

    Understanding query reformulation patterns is a key step towards next generation web search engines: it can help improving users' web-search experience by predicting their intent, and thus helping them to locate information more effectively.

    As a step in this direction, we build an accurate model for classifying user query reformulations into broad classes (generalization, specialization, error correction or parallel move), achieving 92% accuracy. We apply the model to automatically label two large query logs, creating annotated query-flow graphs. We study the resulting reformulation patterns, finding results consistent with previous studies done on smaller manually annotated datasets, and discovering new interesting patterns, including connections between reformulation types and topical categories.

    Finally, applying our findings to a third query log that is publicly available for research purposes, we demonstrate that our reformulation classifier leads to improved recommendations in a query recommendation system.

    PDF version.
  9.  Paolo Boldi, Francesco Bonchi, Carlos Castillo, and Sebastiano Vigna. Voting in social networks. In Proc. of ACM 18th Conference on Information and Knowledge Management (CIKM), Napa Valley, CA, USA, 2009. ACM Press.
    Abstract

    A voting system is a set of rules that a community adopts to take collective decisions. In this paper we study voting systems for a particular kind of community: electronically mediated social networks. In particular, we focus on delegative democracy (a.k.a. proxy voting) that has recently received increased interest for its ability to combine the benefits of direct and representative systems, and that seems also perfectly suited for electronically mediated social networks. In such a context, we consider a voting system in which users can only express their preference for one among the people they are explicitly connected with, and this preference can be propagated transitively, using an attenuation factor.

    We present this system and we study its properties. We also take into consideration the problem of missing votes, which is particularly relevant in online networks, as some recent case shows. Our experiments on real-world networks provide interesting insight into the significance and stability of the results obtained with the suggested voting system.

    PDF version.
  10.  Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Fast prefix search in little space, with applications. In Mark de Berg and Ulrich Meyer, editors, Algorithms - ESA 2010, 18th Annual European Symposium, Liverpool, UK, September 6-8, 2010. Proceedings, Part I, volume 6346 of Lecture Notes in Computer Science, pages 427−438. Springer, 2010.
    Abstract

    A prefix search returns the strings out of a given collection S that start with a given prefix. Traditionally, prefix search is solved by data structures that are also dictionaries, that is, they actually contain the strings in S. For very large collections stored in slow-access memory, we propose extremely compact data structures that solve weak prefix searches—they return the correct result only if some string in S starts with the given prefix. Our data structures for weak prefix search use O(|S |log ℓ) bits in the worst case, where ℓ is the average string length, as opposed to O(|S| ℓ) bits for a dictionary. We show a lower bound implying that this space usage is optimal.

    PDF version.
  11.  Paolo Boldi, Francesco Bonchi, Carlos Castillo, Debora Donato, and Sebastiano Vigna. Query suggestions using query-flow graphs. In Proceedings of the 2009 workshop on Web Search Click Data, WSCD '09, pages 56−63. ACM, 2009.
    Abstract

    The query-flow graph is an aggregated representation of the latent querying behavior contained in a query log. Intuitively, in the query-flow graph a directed edge from query qi to query qj means that the two queries are likely to be part of the same search mission. Any path over the query-flow graph may be seen as a possible search task, whose likelihood is given by the strength of the edges along the path. An edge (qiqj) is also labelled with some information: e.g., the probability that user moves from qi to qj, or the type of the transition, for instance, the fact that qj is a specialization of qi.

    In this paper we propose, and experimentally study, query recommendations based on short random walks on the query-flow graph. Our experiments show that these methods can match in precision, and often improve, recommendations based on query-click graphs, without using users' clicks. Our experiments also show that it is important to consider transition-type labels on edges for having good quality recommendations.

    Finally, one feature that we had in mind while devising our methods was that of providing diverse sets of recommendations: the experimentation that we conducted provides encouraging results in this sense.

    PDF version.
  12.  Djamal Belazzougui, Paolo Boldi, and Sebastiano Vigna. Dynamic z-fast tries. In Edgar Chávez and Stefano Lonardi, editors, String Processing and Information Retrieval - 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010. Proceedings, volume 6393 of Lecture Notes in Computer Science, pages 159−172. Springer, 2010.
    Abstract

    We describe a dynamic version of the z-fast trie, a new data structure inspired by the research started by the van Emde Boas trees and followed by the development of y-fast tries. The dynamic z-fast trie is a very simple, uniform data structure: given a set S of n variable-length strings, it is formed by a standard compacted trie on S (with two additional pointers per node), endowed with a dictionary of size n−1. With this simple setup, the dynamic z-fast trie provides predecessors/successors in time O(log max{ |x|, |x+|, |x-| }) (x± is the successor/predecessor of x in S) for strings of length linear in the machine-word size w. Prefix queries are answered in time O(log|x|+k), and range queries in time O(log max{ |x|, |y|, |x-|, |y+| }+k), where k is the number of elements in the output and x (and y) represent the input of the prefix (range) queries. Updates are performed within the same bounds in expectation (or with high probability using an appropriate dictionary). We then show a simple modification that makes it possible to handle strings of length up to 2w; in this case, predecessor/successor queries and updates are supported in O(|x|/w + log max{ |x|, |x+|, |x-| }) time, (and O(|x|/B + log max{ |x|, |x+|, |x-| }) I/Os in the cache-oblivious model) with high probability. The space occupied by a dynamic z-fast trie, beside that necessary to store S, is just of 12n pointers, n integers and, in the “long string” case, O(n) signatures of O(w) bits each.

  13.  Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Monotone minimal perfect hashing: Searching a sorted table with O(1) accesses. In Proceedings of the 20th Annual ACM-SIAM Symposium On Discrete Mathematics (SODA), pages 785−794, New York, 2009. ACM Press.
    Abstract

    A minimal perfect hash function maps a set S of n keys into the set {0,1,…, n − 1} bijectively. Classical results state that minimal perfect hashing is possible in constant time using a structure occupying space close to the lower bound of log e bits per element. Here we consider the problem of monotone minimal perfect hashing, in which the bijection is required to preserve the lexicographical ordering of the keys. A monotone minimal perfect hash function can be seen as a very weak form of index that provides ranking just on the set S (and answers randomly outside of S). Our goal is to minimise the description size of the hash function: we show that, for a set S of n elements out of a universe of 2w elements, O(n log log w) bits are sufficient to hash monotonically with evaluation time O(log w). Alternatively, we can get space O(n log w) bits with O(1) query time. Both of these data structures improve a straightforward construction with O(n log w) space and O(log w) query time. As a consequence, it is possible to search a sorted table with O(1) accesses to the table (using additional O(n log log w) bits). Our results are based on a structure (of independent interest) that represents a trie in a very compact way, but admits errors. As a further application of the same structure, we show how to compute the predecessor (in the sorted order of S) of an arbitrary element, using O(1) accesses in expectation and an index of O(n log w) bits, improving the trivial result of O(nw) bits. This implies an efficient index for searching a blocked memory.

    PDF version. The algorithms described in the paper are implemented in Sux4J.
  14.  Paolo Boldi, Francesco Bonchi, Carlos Castillo, Debora Donato, Aristides Gionis, and Sebastiano Vigna. The query-flow graph: model and applications. In Proc. of ACM 17th Conference on Information and Knowledge Management (CIKM), pages 609−618, Napa Valley, CA, USA, 2008. ACM Press.
    Abstract

    Query logs record the queries and the actions of the users of search engines, and as such they contain valuable information about the interests, the preferences, and the behavior of the users, as well as their implicit feedback to search-engine results. Mining the wealth of information available in the query logs has many important applications including query-log analysis, user profiling and personalization, advertising, query recommendation, and more.

    In this paper we introduce the query-flow graph, a graph representation of the interesting knowledge about latent querying behavior. Intuitively, in the query-flow graph a directed edge from query qi to query qj means that the two queries are likely to be part of the same “search mission”. Any path over the query-flow graph may be seen as a searching behavior, whose likelihood is given by the strength of the edges along the path.

    The query-flow graph is an outcome of query-log mining and, at the same time, a useful tool for it. We propose a methodology that builds such a graph by mining time and textual information as well as aggregating queries from different users. Using this approach we build a real-world query-flow graph from a large-scale query log and we demonstrate its utility in concrete applications, namely, finding logical sessions, and query recommendation. We believe, however, that the usefulness of the query-flow graph goes beyond these two applications.

    PDF version.
  15.  Paolo Boldi and Sebastiano Vigna. Efficient optimally lazy algorithms for minimal-interval semantics, 2007.
    Abstract

    Minimal-interval semantics associates with each query over a document a set of intervals, called witnesses, that are incomparable with respect to inclusion (i.e., they form an antichain): witnesses define the minimal regions of the document satisfying the query. Minimal-interval semantics makes it easy to define and compute several sophisticated proximity operators, provides snippets for user presentation, and can be used to rank documents. In this paper we provide algorithms for computing conjunction and disjunction that are linear in the number of intervals and logarithmic in the number of operands; for additional operators, such as ordered conjunction and Brouwerian difference, we provide linear algorithms. In all cases, space is linear in the number of operands. More importantly, we define a formal notion of optimal laziness, and either prove it, or prove its impossibility, for each algorithm. Optimal laziness implies that the algorithms do not assume random access to the input intervals, and read as little input as possible to produce a certain output. We cast our results in the general framework of antichain completions of interval orders, making our algorithms directly applicable to other domains.

    PDF version. The algorithms described in the paper are implemented in MG4J and LaMa4J.
  16.  Sebastiano Vigna. Stanford matrix considered harmful. In Andreas Frommer, Michael W. Mahoney, and Daniel B. Szyld, editors, Web Information Retrieval and Linear Algebra Algorithms, number 07071 in Dagstuhl Seminar Proceedings, 2007. http://arxiv.org/abs/0710.1962.
    PDF version.
  17.  Paolo Boldi, Massimo Santini, and Sebastiano Vigna. Permuting web graphs. In Konstantin Avrachenkov, Debora Donato, and Nelly Litvak, editors, Algorithms and Models for the Web-Graph, 6th International Workshop, WAW 2009, volume 5427 of Lecture Notes in Computer Science, pages 116−126. Springer, 2009.
    Abstract

    Since the first investigations on web graph compression, it has been clear that the ordering of the nodes of the graph has a fundamental influence on the compression rate (usually expressed as the number of bits per link). The authors of the LINK database, for instance, investigated three different approaches: an extrinsic ordering (URL ordering) and two intrinsic (or coordinate-free) orderings based on the rows of the adjacency matrix (lexicographic and Gray code); they concluded that URL ordering has many advantages in spite of a small penalty in compression. In this paper we approach this issue in a more systematic way, testing some old orderings and proposing some new ones. Our experiments are made in the WebGraph framework, and show that the compression technique and the structure of the graph can produce significantly different results. In particular, we show that for the transpose web graph URL ordering is significantly less effective, and that some new orderings combining host information and Gray/lexicographic orderings outperform all previous methods. In particular, in some large transposed graphs they yield the quite incredible compression rate of 1 bit per link.

    PDF version; datasets and Java™ implementations are available as free software at the LAW site.
  18.  Ilaria Bordino, Paolo Boldi, Debora Donato, Massimo Santini, and Sebastiano Vigna. Temporal evolution of the UK web. In ICDM Workshops, pages 909−918. IEEE Computer Society, 2008.
    Abstract

    Recently, a new temporal dataset has been made public: it is made of a series of twelve 100M pages snapshots of the .uk domain. The Web graphs of the twelve snapshots have been merged into a single time-aware graph that provide constant-time access to temporal information. In this paper we present the first statistical analysis performed on this graph, with the goal of checking whether the information contained in the graph is reliable (i.e., whether it depends essentially on appearance and disappearance of pages and links, or on the crawler behaviour). We perform a number of tests that show that the graph is actually reliable, and provide the first public data on the evolution of the Web that use a large scale and a significant diversity in the sites considered.

    PDF version; datasets and Java™ implementations are available as free software at the LAW site.
  19.  Sebastiano Vigna. Distributed, large-scale latent semantic analysis by index interpolation. In Ronny Lempel, Raffaele Perego, and Fabrizio Silvestri, editors, InfoScale '08: Proceedings of the 3rd international conference on Scalable information systems, Vico Equense, Italy, 2008. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering).
    Abstract

    Latent semantic analysis is a well-known technique to extrapolate concepts from a set of documents; it discards noise by reducing the rank of (a variant of) the term/document matrix of a document collection by singular value decomposition. The latter is performed by solving an equivalent symmetric eigenvector problem on a related matrix. Scaling to large set of documents, however, is problematic because every vector-matrix multiplication required by iterative solvers requires a number of multiplications equal to twice the number of postings of the collection. We show how to combine standard search-engine algorithmic tools in such a way to compute (reasonably) quickly the cooccurrence matrix C of a large document collection, and solve directly the associated symmetric eigenvector problem. Albeit the size of C is quadratic in the number of terms, we can distribute its computation among any number of computational unit without increasing the overall number of multiplications. Moreover, our approach is advantageous when the document collection is large, because the number of terms over which latent semantic analysis has to be performed is inherently limited by the size of a language lexicon. We present experiments over a collection with 3.6 billions of postings—two orders of magnitudes larger than any published experiment in the literature.

    PDF version.
  20.  Sebastiano Vigna. Broadword implementation of rank/select queries. In Catherine C. McGeoch, editor, Experimental Algorithms. 7th International Workshop, WEA 2008, number 5038 in Lecture Notes in Computer Science, pages 154−168. Springer−Verlag, 2008.
    Abstract

    Research on succinct data structures (data structures occupying space close to the information-theoretical lower bound, but achieving speed similar to their standard counterparts) has steadily increased in the last few years. However, many theoretical constructions providing asymptotically optimal bounds are unusable in practice because of the very large constants involved. The study of practical implementations of the basic building blocks of such data structures is thus fundamental to obtain practical applications. In this paper we argue that 64-bit and wider architectures are particularly suited to very efficient implementations of rank (counting the number of ones up to a given position) and select (finding the position of the i-th bit set), two essential building blocks of all succinct data structures. Contrarily to typical 32-bit approaches, involving precomputed tables, we use pervasively broadword (a.k.a. SWAR—“SIMD in A Register”) programming, which compensates the constant burden associated to succinct structures by solving problems in parallel in a register. We provide an implementation named rank9 that addresses 264 bits, consumes less space and is significantly faster then current state-of-the-art 32-bit implementations, and a companion select9 structure that selects in nearly constant time using only access to aligned data. For sparsely populated arrays, we provide a simple broadword implementation of the Elias–Fano representation of monotone sequences. In doing so, we develop broadword algorithms to perform selection in a word or in a sequence of words that are of independent interest.

    PDF version; a complete C++ and Java™ implementation is available at the Sux project home page.
  21.  Paolo Boldi, Roberto Posenato, Massimo Santini, and Sebastiano Vigna. Traps and pitfalls of topic-biased PageRank. In William Aiello, Andrei Broder, Jeannette Janssen, and Evangelos Milios, editors, WAW 2006. Fourth Workshop on Algorithms and Models for the Web-Graph, volume 4936 of Lecture Notes in Computer Science, pages 107−116. Springer−Verlag, 2008.
    Abstract

    We discuss a number of issues in the definition, computation and comparison of PageRank values that have been addressed sparsely in the literature, often with contradictory approaches. We study the difference between weakly and strongly preferential PageRank, which patch the dangling nodes with different distributions, extending analytical formulae known for the strongly preferential case, and corroborating our results with experiments on a snapshot of 100 millions of pages of the .uk domain. The experiments show that the two PageRank versions are poorly correlated, and results about each one cannot be blindly applied to the other; moreover, our computations highlight some new concerns about the usage of exchange-based correlation indices (such as Kendall's τ) on approximated rankings.

    PDF version.
  22.  Carlos Castillo, Debora Donato, Luca Becchetti, Paolo Boldi, Stefano Leonardi, Massimo Santini, and Sebastiano Vigna. A reference collection for web spam. SIGIR Forum, 40(2):11−24, 2006.
    Abstract

    We describe the WEBSPAM-UK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if the hosts are include Web spam aspects or not. This is the first publicly available Web spam collection that includes page contents and links, and that has been labelled by a large and diverse set of judges.

  23.  Paolo Boldi and Sebastiano Vigna. Compressed perfect embedded skip lists for quick inverted-index lookups. In Proc. SPIRE 2005, volume 3772 of Lecture Notes in Computer Science, pages 25−28. Springer−Verlag, 2005.
    Abstract

    Large inverted indices are by now common in the construction of web-scale search engines. For faster access, inverted indices are indexed internally so that it is possible to skip quickly over unnecessary documents. The classical approach to skipping dictates that a skip should be positioned every f1/2 document pointers, where f is the overall number of documents where the term appears. We argue that due to the growing size of the web more refined techniques are necessary, and describe how to embed a compressed perfect skip list in an inverted list. We provide statistical models that explain the empirical distribution of the skip data we observe in our experiments, and use them to devise good compression techniques that allow us to limit the waste in space, so that the resulting data structure increases the overall index size by just a few percents, still making it possible to index pointers with a rather fine granularity.

    PDF version of the extended draft.
  24.  Sebastiano Vigna. TruRank: Taking PageRank to the limit. In Fourteenth International World Wide Web Conference (WWW 2005), Special Interest Tracks & Posters, pages 976−977, Chiba, Japan, 2005. ACM Press.
    Abstract

    PageRank is defined as the stationary state of a Markov chain depending on a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α = 0.85 by Brin and Page is still used. It is common belief that values of α closer to 1 give a “truer to the web” PageRank, but a small α accelerates convergence. Recently, however, it has been shown that when α = 1 all pages in the core component are very likely to have rank 0. This behaviour makes it difficult to understand PageRank when α ≈ 1, as it converges to a meaningless value for most pages. We propose a simple and natural modification to the standard preprocessing performed on the adjacency matrix of the graph, resulting in a ranking scheme we call TruRank. TruRank ranks the web with principles almost identical to PageRank, but it gives meaningful values also when α ≈ 1.

    PDF version; PostScript version.
  25.  Paolo Boldi and Sebastiano Vigna. Efficient lazy algorithms for minimal-interval semantics. In Fabio Crestani, Paolo Ferragina, and Mark Sanderson, editors, Proc. SPIRE 2006, number 4209 in Lecture Notes in Computer Science, pages 134−149. Springer−Verlag, 2006.
    Abstract

    Minimal-interval semantics associates with each query over a document a set of intervals, called witnesses, that are incomparable with respect to inclusion (i.e., they form an antichain): witnesses define the minimal regions of the document satisfying the query. Minimal-interval semantics makes it easy to define and compute several sophisticated proximity operators, provides snippets for user presentation, and can be used to rank documents: thus, computing efficiently the antichains obtained by operations such as logic conjunction and disjunction is a basic issue. In this paper we provide the first algorithms for computing such operators that are linear in the number of intervals and logarithmic in the number of input antichains. The space used is linear in the number of antichains. Moreover, the algorithms are lazy—they do not assume random access to the input antichains. These properties make the usage of our algorithms feasible in large-scale web search engines.

    The algorithms described in this paper are by now obsolete. Please refer to our new paper.
  26.  Paolo Boldi, Massimo Santini, and Sebastiano Vigna. PageRank as a function of the damping factor. In Proc. of the Fourteenth International World Wide Web Conference (WWW 2005), pages 557−566, Chiba, Japan, 2005. ACM Press.
    Abstract

    PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α=0.85 by Brin and Page is still used. Recently, however, the behaviour of PageRank with respect to changes in α was discovered to be useful in link-spam detection. Moreover, an analytical justification of the value chosen for α is still missing. In this paper, we give the first mathematical analysis of PageRank when α changes. In particular, we show that, contrarily to popular belief, for real-world graphs values of α close to 1 do not give a more meaningful ranking. Then, we give closed-form formulae for PageRank derivatives of any order, and an extension of the Power Method that approximates them with convergence O(tk αt) for the k-th derivative. Finally, we show a tight connection between iterated computation and analytical behaviour by proving that the k-th iteration of the Power Method gives exactly the PageRank value obtained using a Maclaurin polynomial of degree k. The latter result paves the way towards the application of analytical methods to the study of PageRank.

    PDF version; PostScript version.
  27.  Sebastiano Vigna. Reachability problems in entity-relationship schema instances. In Conceptual Modeling—ER 2004. 23rd International Conference on Conceptual Modeling, number 3288 in Lecture Notes in Computer Science, pages 96−109. Springer−Verlag, 2004.
    Abstract

    Recent developments in reification of ER schemata include automatic generation of web-based database administration systems [ERW]. These systems enforce the schema cardinality constraints, but, beyond unsatisfiable schemata, this feature may create unreachable instances. We prove sound and complete characterisations of schemata whose instances satisfy suitable reachability properties; these theorems translate into linear algorithms that can be used to prevent the administrator from reifying schemata with unreachable instances.

    PDF version; PostScript version; to download ERW please look at the ERW home page.
  28.  Paolo Boldi and Sebastiano Vigna. The WebGraph framework I: Compression techniques. In Proc. of the Thirteenth International World Wide Web Conference (WWW 2004), pages 595−601, Manhattan, USA, 2004. ACM Press.
    Abstract

    Studying web graphs is often difficult due to their large size. Recently, several proposals have been published about various techniques that allow to store a web graph in memory in a limited space, exploiting the inner redundancies of the web. The WebGraph framework is a suite of codes, algorithms and tools that aims at making it easy to manipulate large web graphs. This papers presents the compression techniques used in WebGraph, which are centred around referentiation and intervalisation (which in turn are dual to each other). WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks) in as little as 3.08 bits per link, and its transposed version in as little as 2.89 bits per link.

    PDF version; PostScript version; to download software and data sets, please look at the WebGraph home page.
  29.  Paolo Boldi and Sebastiano Vigna. WebGraph: Things you thought you could not do with Java™. In Proc. of the 3rd International Conference on Principles and Practice of Programming in Java, ACM International Conference Proceedings Series, pages 1−8, Las Vegas, Nevada, USA, 2004. Computer Science Press, Trinity College, Dublin, Ireland.
  30.  Sebastiano Vigna. Automatic generation of content management systems from EER-based specifications. In 18th IEEE International Conference on Automated Software Engineering, pages 259−262, Montréal, Québec, Canada, 2003. IEEE Press. Short paper.
    Abstract

    ERW is an innovative open-source system for handling complex databases using a web browser. Once the details of an enhanced entity-relationship schema have been specified in XML, ERW generates a complete application that lets the user interact with the database. Then, specification percolation makes it possible to customise heavily the application while maintaining the flexibility of a model-driven approach.

    PDF version; PostScript version; to download ERW please look at the ERW home page.
  31.  Paolo Boldi and Sebastiano Vigna. Rethinking Java strings. In Proc. of the 2nd International Conference on Principles and Practice of Programming in Java, ACM International Conference Proceedings Series, pages 27−30, Kilkenny City, Ireland, 2003. Computer Science Press, Inc., New York, NY, USA.
    Abstract

    The Java string classes, String and StringBuffer, lie at the extremes of a spectrum (immutable, reference-based and mutable, content-based). Motivated by data-intensive text applications, we propose a new string class, MutableString, which tries to embody the best of both approaches.

    PDF version; PostScript version; to download MutableString please look at the DSI utilities home page.
  32.  Sebastiano Vigna. Multirelational semantics for extended entity-relationship schemata with applications. In Conceptual Modeling—ER 2002. 21st International Conference on Conceptual Modeling, number 2503 in Lecture Notes in Computer Science, pages 35−49. Springer−Verlag, 2002.
    Abstract

    This paper describes a multirelation semantics for a fragment of the Extended Entity-Relationship schemata formalism based on the bicategorical definition of multirelations. The approach we follow is elementary—we try to introduce as few notions as possible. We claim that bicategorical algebra handles gracefully multirelations and their operations, and that multirelations are essential in a number of applications; moreover, the bicategorical composition of multirelations turns out to correspond to natural joins. From the formal semantics we derive an algorithm that can establish statically the possibility of building parallel ownership paths of weak entities. The ideas described in this paper have been implemented in a free tool, ERW, which lets users edit sets and multirelations instantiating an EER schema via a sophisticated web interface.

    PDF version; PostScript version; to download ERW please look at the ERW home page.
  33.  Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. UbiCrawler: Scalability and fault-tolerance issues. In Poster Proc. of Eleventh International World Wide Web Conference, Honolulu, USA, 2002.
    Abstract

    We report on the implementation of UbiCrawler, a scalable distributed web crawler, analyze its performance, and describe its fault tolerance.

    Online version; Downloadable version; PDF version; PostScript version.
  34.  Sebastiano Vigna. ERW: Entities and relationships on the web. In Poster Proc. of Eleventh International World Wide Web Conference, Honolulu, USA, 2002.
    Abstract

    ERW is an innovative system for handling complex databases using a web browser. It uses the most recent standards endorsed by the W3C to offer to the user a sophisticated environment, similar to a dedicated client. Moreover, the user interface is generated in a completely automatic way starting from a conceptual description of the database by means of an XML-based language for entity-relationship schemata.

    Online version; Downloadable version; PDF version; PostScript version; to download ERW please look at the ERW home page.
  35.  Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Structural properties of the African web. In Poster Proc. of Eleventh International World Wide Web Conference, Honolulu, USA, 2002.
    Abstract

    In this poster we illustrate some data about the African web. These data have been collected using UbiCrawler, a distributed Web crawler designed and developed by the authors.

    Online version; Downloadable version; PDF version; PostScript version.
  36.  Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Trovatore: Towards a highly scalable distributed web crawler. In Poster Proc. of Tenth International World Wide Web Conference, pages 140−141, Hong Kong, China, 2001. Winner of the Best Poster Award.
    Abstract

    Trovatore is an ongoing project aimed at realizing an efficient distributed and highly scalable web crawler. This poster illustrates the main ideas behind its design.

    PDF version; PostScript version.
  37.  Paolo Boldi and Sebastiano Vigna. Holographic trees. In LATIN 2002, number 2286 in Lecture Notes in Computer Science, pages 465−478, Cancún, México, 2002. Springer−Verlag.
    Abstract

    It is known that computations of anonymous networks can be reduced to the construction of a certain graph, the minimum base of the network. The crucial step of this construction is the inference of the minimum base from a finite tree that each processor can build (its truncated view). We isolate those trees that make this inference possible, and call them holographic. Intuitively, a tree is holographic if it is enough self-similar to be uniquely extendible to an infinite tree. This possibility depends on a size function for the class of graphs under examination, which we call a holographic bound for the class. Holographic bounds give immediately, for instance, bounds for the quiescence time of self-stabilizing protocols. In this paper we give weakly tight holographic bounds for some classes of graphs.

    PDF version; PostScript version.
  38.  Paolo Boldi and Sebastiano Vigna. An effective characterization of computability in anonymous networks. In Jennifer L. Welch, editor, Distributed Computing. 15th International Conference, DISC 2001, number 2180 in Lecture Notes in Computer Science, pages 33−47. Springer−Verlag, 2001.
    Abstract

    We provide effective (i.e., recursive) characterizations of the relations that can be computed on networks where all processors use the same algorithm, start from the same state, and know at least a bound on the network size. Three activation models are considered (synchronous, asynchronous, interleaved).

    PDF version; PostScript version.
  39.  Charles Meyssonnier, Paolo Boldi, and Sebastiano Vigna. δ-approximable functions. In Proc. CCA 2000 (Computability and Complexity in Analysis), number 2064 in Lecture Notes in Computer Science, pages 187−199, Swansea, Wales, UK, 2001. Springer−Verlag.
    Abstract

    In this paper we study several notions of approximability of functions in the framework of the BSS model. Denoting with φMδ the function computed by a BSS machine M when its comparisons are against δ rather than 0, we study classes of functions f for which φMδf in some sense (pointwise, uniformly, etc.). The main equivalence results show that this notion coincides with Type 2 computability when the convergence speed is recursively bounded. Finally, we study the possibility of extending these results to computations over Archimedean fields.

    PDF version; PostScript version.
  40.  Paolo Boldi and Sebastiano Vigna. Computing anonymously with arbitrary knowledge. In Proc. 18th ACM Symposium on Principles of Distributed Computing, pages 181−188. ACM Press, 1999.
    Abstract

    We provide characterizations of the relations that can be computed with arbitrary knowledge on networks where all processors use the same algorithm and start from the same state (in particular, we do not assume that a bound on the network size is known). Three activation models are considered (synchronous, asynchronous, interleaved).

    PDF version; PostScript version.
  41.  Paolo Boldi and Sebastiano Vigna. Self-stabilizing universal algorithms. In Sukumar Ghosh and Ted Herman, editors, Self−Stabilizing Systems (Proc. of the 3rd Workshop on Self−Stabilizing Systems, Santa Barbara, California, 1997), volume 7 of International Informatics Series, pages 141−156. Carleton University Press, 1997.
  42.  Paolo Boldi, Bruno Codenotti, Peter Gemmell, Shella Shammah, Janos Simon, and Sebastiano Vigna. Symmetry breaking in anonymous networks: Characterizations. In Proc. 4th Israeli Symposium on Theory of Computing and Systems, pages 16−26. IEEE Press, 1996.
  43.  Paolo Boldi and Sebastiano Vigna. On some constructions which preserve sense of direction. In Nicola Santoro and Paul Spirakis, editors, Structure, Information and Communication Complexity. Proc. 3rd Colloquium SIROCCO '96, volume 6 of International Informatics Series, pages 47−57. Carleton University Press, 1997.
  44.  Paolo Boldi and Sebastiano Vigna. Computing vector functions on anonymous networks. In Danny Krizanc and Peter Widmayer, editors, SIROCCO '97. Proc. 4th International Colloquium on Structural Information and Communication Complexity, volume 1 of Proceedings in Informatics, pages 201−214. Carleton Scientific, 1997.
  45.  Pierpaolo Degano, Roberto Gorrieri, and Sebastiano Vigna. On relating some models for concurrency. In TAPSOFT '93: Theory and Practice of Software Development, 4th International Joint Conference CAAP/FASE, number 668 in Lecture Notes in Computer Science, pages 15−30, Orsay, France, 1993.
  46.  Luca Bernardinello, Giorgio De Michelis, Katia Petruni, and Sebastiano Vigna. On the synchronic structure of transition systems. In Jorg Desel, editor, Structures in Concurrency Theory. Proceedings of the International Workshop on Structures in Concurrency Theory (STRICT), Workshops in Computing, pages 69−84, Berlin, 1995. Springer−Verlag.
  47.  Nicoletta Sabadini, Sebastiano Vigna, and Robert F.C. Walters. A notion of refinement for automata. In M. Nivat, C. Rattray, T. Rus, and G. Scollo, editors, Algebraic Methodology and Software Technology (AMAST'93), Proceedings of the Third International Conference on Algebraic Methodology and Software Technology, Workshops in Computing, pages 327−334, University of Twente, The Netherlands, 1993. Springer−Verlag.
  48.  Stefano Kasangian and Sebastiano Vigna. Trees in a distributive category. In Proceedings of the Category Theory '90 Conference in Como, number 1488 in Lecture Notes in Mathematics, pages 237−248. Springer−Verlag, 1991.
  49.  Stefano Kasangian and Sebastiano Vigna. Introducing a calculus of trees. In Proceedings of the International Joint Conference on Theory and Practice of Software Development (TAPSOFT/CAAP '91), number 493 in Lecture Notes in Computer Science, pages 215−240. Springer−Verlag, 1991.
  50.  Pierpaolo Degano, Stefano Kasangian, and Sebastiano Vigna. Applications of the calculus of trees to process description languages. In Proceedings of the CTCS '91 Conference, number 530 in Lecture Notes in Computer Science, pages 282−301, 1991.

Miscellaneous

  1.  Paolo Boldi, Massimo Santini, and Sebastiano Vigna. A large time-aware graph. SIGIR Forum, 42(2):33−38, 2008.
    Abstract

    We describe the techniques developed to gather and distribute in a highly compressed, yet accessible, form a series of twelve snapshot of the .uk web domain. Ad hoc compression techniques made it possible to store the twelve snapshots using just 1.9 bits per link, with constant-time access to temporal information. Our collection makes it possible to study the temporal evolution link-based scores (e.g., PageRank), the growth of online communities, and in general time-dependent phenomena related to the link structure.

    PDF version; datasets and Java™ implementations are available as free software at the LAW site.
  2.  Alessio Orlandi and Sebastiano Vigna. Compressed collections for simulated crawling. SIGIR Forum, 42(2):39−44, 2008.
    Abstract

    Collections are a fundamental tool for reproducible evaluation of information retrieval techniques. We describe a new method for distributing the document lengths and term counts (a.k.a. within-document frequencies) of a web snapshot in a highly compressed and nonetheless quickly accessible form. Our main application is reproducibility of the behaviour of focused crawlers: by coupling our collection with the corresponding web graph compressed with WebGraph we make it possible to apply text-based machine learning tools to the collection, while keeping the data set footprint small. Finally, we describe a collection based on a crawl of 100Mpages of the .uk domain, publicly available in bundle with a Java open-source implementation of our techniques.

    PDF version; datasets and Java™ implementations are available as free software at the LAW site.
  3.  Paolo Boldi and Sebastiano Vigna. A tool for optimal weak sense of direction. Note del Polo (ricerca) 27, Università di Milano, 2000.
    Abstract

    Weak sense of direction is a property of the labelling of (possibly anonymous) networks that allows one to assign coherently local identifiers to other processors on the basis of the route followed by incoming messages. Though there exists a linear algorithm that allows one to assign a weak sense of direction to any given network, the number of colours used in such construction may be as large as the number of processors. It is an open, difficult, yet intriguing problem that of establishing an optimal weak sense of direction for a given graph, that is, a weak sense of direction using as few colours as possible. To attack this problem, we have developed an implicit enumeration algorithm that searches for optimal weak sense of direction; the algorithm is implemented in a publicly available tool, optwsod. Although optwsod can only deal with graphs that are reasonably small, we think that the results obtained (some examples are presented at the end of the paper) are sometimes surprising and may be used as an inspiration to determine optimal weak sense of direction for classes of graphs. This paper reports some details of the design and implementation of optwsod, and some experimental data about its behaviour.

    PDF version; PostScript version. You can also download the tool.
  4.  Sebastiano Vigna. A guided tour in the topos of graphs. Technical Report 199-97, Università di Milano, Dipartimento di Scienze dell'Informazione, 1997.
    Abstract

    In this paper we survey the fundamental constructions of a presheaf topos in the case of the elementary topos of graphs. We prove that the transition graphs of nondeterministic automata (a.k.a. labelled transition systems) are the separated presheaves for the double negation topology, and obtain as an application that their category is a quasitopos.

    PDF version; PostScript version.
  5.  Sebastiano Vigna. On generalized regions. Technical Report 198-97, Università di Milano, 1997.
  6.  Sebastiano Vigna. Distributive Computability. PhD thesis, Dipartimento di Scienze dell'Informazione, Università degli Studi di Milano, 1995.
    Abstract

    This thesis presents a series of theoretical results and practical realisations about the theory of computation in distributive categories. Distributive categories have been proposed as a foundational tool for Computer Science in the last years, starting from the papers of R.F.C. Walters. We shall focus on two major topics: distributive computability, i.e., a generalized theory of computability based on distributive categories, and the Imp(G) language, which is a language based on the syntax of distributive categories. The link between the former and the latter is that the functions computed by Imp(G) programs are exactly the distributively computable functions.

    PDF version; PostScript version.