Class Index
- java.lang.Object
-
- it.unimi.di.big.mg4j.index.Index
-
- All Implemented Interfaces:
Serializable
- Direct Known Subclasses:
BitStreamIndex,IndexCluster,QuasiSuccinctIndex
public abstract class Index extends Object implements Serializable
An abstract representation of an index.Concrete subclasses of this class represent abstract index access information: for instance, the basename or IP address/port, flags, etc. It allows to build easily index readers over the index: in turn, index readers provide document iterators.
This class contains just methods declarations, and attributes for all data that is common to any form of index. Note that we use an abstract class, rather than an interface, because interfaces do not allow to declare attributes.
We provide static factory methods (e.g.,
getInstance(CharSequence)) that return an index given a suitable URI string. If the scheme part is mg4j, then the URI is assumed to point at a remote index. Otherwise, it is assumed to be the basename of a local index. In both cases, a query part introduced by ? can specify additional parameters (key=value pairs separated by ;). For instance, the URI example?inmemory=1 will load the index with basename example, caching its content in core memory. Please have a look at constants inIndex.UriKeys(and analogous enums in subclasses) for additional parameters.If the index is local, by convention this class will locate a property file with extension
DiskBasedIndex.PROPERTIES_EXTENSIONthat is expected to contain a number of key/value pairs (which are quite informative and can be examined manually). In particular, the keyIndex.PropertyKeys.INDEXCLASSexplain which kind of index class should be used to read the index. The file might contain additional keys depending on the value ofIndex.PropertyKeys.INDEXCLASS(e.g.,QuasiSuccinctIndex.PropertyKeys.BYTEORDER). An index usually exposes term or prefix maps and the size list but this is not compulsory (the latter, in particular, is necessary with certain codings).Thread safety
Indices are a natural candidate for multithreaded access. An instance of this class must be thread safe as long as external data structures provided to its constructors are. For instance, the tool
IndexBuildergenerates a synchronizedImmutableExternalPrefixMapso that by default the resulting index is thread safe.For instance, a
DiskBasedIndexrequires a list of term offsets, term maps, etc. As long as all these data structures are thread safe, the same is true of the index. Data structures created by static factory methods such asDiskBasedIndex.getInstance(CharSequence)are thread safe.Note that
IndexReaders returned bygetReader()are not thread safe (even if the methodgetReader()is). The logic behind this arrangement is that you create as many reader as you need, and thenCloseable.close()them. In a multithreaded environment, a pool of index readers can be created, and a customQueryBuilderVisitorcan be used to buildDocumentIterators using the given pool of readers. In this case readers are not closed, but rather reused.Read-once load
Implementations of this class are strongly encouraged to offer read-once constructors and factory methods: property files and other data related to the index (but not to an
IndexReadershould be read exactly once, and sequentially. This feature is very useful when combining indices.- Since:
- 0.9
- Author:
- Paolo Boldi, Sebastiano Vigna
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description classIndex.EmptyIndexIteratorAn iterator returning no documents based on this index.static classIndex.PropertyKeysSymbolic names for properties of aIndex.static classIndex.UriKeysKeys to be used (downcased) in specifiying additional parameters to a MG4J URI.
-
Field Summary
Fields Modifier and Type Field Description StringfieldThe field indexed by this index, ornull.booleanhasCountsWhether this index contains counts.booleanhasPayloadsWhether this index contains payloads; if true,payloadis non-null.booleanhasPositionsWhether this index contains positions.IndexkeyIndexThe index used as a key to retrieve intervals.intmaxCountThe maximum number of positions in an position list, or possibly -1 if this index does not have positions.longnumberOfDocumentsThe number of documents of the collection.longnumberOfOccurrencesThe number of occurrences of the collection, or possibly -1 if it is unknown.longnumberOfPostingsThe number of postings (pairs term/document) of the collection.longnumberOfTermsThe number of terms of the collection.PayloadpayloadThe payload for this index, ornull.PrefixMap<? extends CharSequence>prefixMapThe prefix map for this index, ornullif the prefix map was not loaded.PropertiespropertiesThe properties of this index.ReferenceSet<Index>singletonSetAn immutable singleton set containing justkeyIndex.IntBigListsizesThe size of each document, ornullif sizes are not necessary or not loaded in this index.StringMap<? extends CharSequence>termMapThe term map for this index, ornullif the term map was not loaded.TermProcessortermProcessorThe term processor used to build this index.
-
Constructor Summary
Constructors Modifier Constructor Description protectedIndex(long numberOfDocuments, long numberOfTerms, long numberOfPostings, long numberOfOccurrences, int maxCount, Payload payload, boolean hasCounts, boolean hasPositions, TermProcessor termProcessor, String field, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, IntBigList sizes, Properties properties)Creates a new instance, initialising all fields.
-
Method Summary
Modifier and Type Method Description IndexIteratordocuments(long term)Creates a newIndexReaderfor this index and uses it to return an index iterator over the documents containing a term.IndexIteratordocuments(CharSequence term)Creates a newIndexReaderfor this index and uses it to return an index iterator over the documents containing a term; the term is given explicitly, and the index term map is used, if present.IndexIteratordocuments(CharSequence prefix, int limit)Creates a number of instances ofIndexReaderfor this index and uses them to return aMultiTermIndexIteratorover the documents containing any term our of a set of terms defined by a prefix; the prefix is given explicitly, and unless the index has a prefix map, anUnsupportedOperationExceptionwill be thrown.IndexIteratorgetEmptyIndexIterator()IndexIteratorgetEmptyIndexIterator(long term)IndexIteratorgetEmptyIndexIterator(CharSequence term)IndexIteratorgetEmptyIndexIterator(CharSequence term, long termNumber)static IndexgetInstance(IOFactory ioFactory, CharSequence uri, boolean randomAccess, boolean documentSizes, boolean maps)Returns a new index using the given URI.static IndexgetInstance(CharSequence uri)Returns a new index using the given URI, searching dynamically for term and prefix maps, loading offsets but loading document sizes only if it is necessary.static IndexgetInstance(CharSequence uri, boolean randomAccess)Returns a new index using the given URI, searching dynamically for term and prefix maps and loading document sizes only if it is necessary.static IndexgetInstance(CharSequence uri, boolean randomAccess, boolean documentSizes)Returns a new index using the given URI, searching dynamically for term and prefix maps.static IndexgetInstance(CharSequence uri, boolean randomAccess, boolean documentSizes, boolean maps)Returns a new index using the given URI and noIOFactory.IndexReadergetReader()Creates and returns a newIndexReaderbased on this index, using the default buffer size.abstract IndexReadergetReader(int bufferSize)Creates and returns a newIndexReaderbased on this index.protected static TermProcessorgetTermProcessor(Properties properties)voidkeyIndex(Index newKeyIndex)Sets the index used as a key to retrieve intervals from iterators generated from this index.
-
-
-
Field Detail
-
field
public final String field
The field indexed by this index, ornull.
-
properties
public final Properties properties
The properties of this index. It is stored here for convenience (for instance, if custom keys are added to the property file), but it may benull.
-
numberOfDocuments
public final long numberOfDocuments
The number of documents of the collection.
-
numberOfTerms
public final long numberOfTerms
The number of terms of the collection. This field might be set to -1 in some cases (for instance, in certain documental clusters).
-
numberOfOccurrences
public final long numberOfOccurrences
The number of occurrences of the collection, or possibly -1 if it is unknown.
-
numberOfPostings
public final long numberOfPostings
The number of postings (pairs term/document) of the collection.
-
maxCount
public final int maxCount
The maximum number of positions in an position list, or possibly -1 if this index does not have positions.
-
payload
public final Payload payload
The payload for this index, ornull.
-
hasPayloads
public final boolean hasPayloads
Whether this index contains payloads; if true,payloadis non-null.
-
hasCounts
public final boolean hasCounts
Whether this index contains counts.
-
hasPositions
public final boolean hasPositions
Whether this index contains positions.
-
termProcessor
public final TermProcessor termProcessor
The term processor used to build this index.
-
singletonSet
public ReferenceSet<Index> singletonSet
An immutable singleton set containing justkeyIndex.
-
keyIndex
public Index keyIndex
The index used as a key to retrieve intervals. Usually equal tothis, but it is settable.
-
termMap
public final StringMap<? extends CharSequence> termMap
The term map for this index, ornullif the term map was not loaded.
-
prefixMap
public final PrefixMap<? extends CharSequence> prefixMap
The prefix map for this index, ornullif the prefix map was not loaded.
-
sizes
public final IntBigList sizes
The size of each document, ornullif sizes are not necessary or not loaded in this index.
-
-
Constructor Detail
-
Index
protected Index(long numberOfDocuments, long numberOfTerms, long numberOfPostings, long numberOfOccurrences, int maxCount, Payload payload, boolean hasCounts, boolean hasPositions, TermProcessor termProcessor, String field, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, IntBigList sizes, Properties properties)Creates a new instance, initialising all fields.
-
-
Method Detail
-
getTermProcessor
protected static TermProcessor getTermProcessor(Properties properties)
-
getInstance
public static Index getInstance(IOFactory ioFactory, CharSequence uri, boolean randomAccess, boolean documentSizes, boolean maps) throws IOException, org.apache.commons.configuration.ConfigurationException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
Returns a new index using the given URI.- Parameters:
ioFactory- the factory that will be used to perform I/O, ornull(implying theIOFactory.FILESYSTEM_FACTORYfor disk-based indices).uri- the URI defining the index.randomAccess- whether the index should be accessible randomly.documentSizes- if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).maps- if true, term and prefix maps will be guessed and loaded (this feature might not be available with some kind of index).- Throws:
IOExceptionorg.apache.commons.configuration.ConfigurationExceptionURISyntaxExceptionClassNotFoundExceptionSecurityExceptionInstantiationExceptionIllegalAccessExceptionInvocationTargetExceptionNoSuchMethodException
-
getInstance
public static Index getInstance(CharSequence uri, boolean randomAccess, boolean documentSizes, boolean maps) throws IOException, org.apache.commons.configuration.ConfigurationException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
Returns a new index using the given URI and noIOFactory.- Parameters:
uri- the URI defining the index.randomAccess- whether the index should be accessible randomly.documentSizes- if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).maps- if true, term and prefix maps will be guessed and loaded (this feature might not be available with some kind of index).- Throws:
IOExceptionorg.apache.commons.configuration.ConfigurationExceptionURISyntaxExceptionClassNotFoundExceptionSecurityExceptionInstantiationExceptionIllegalAccessExceptionInvocationTargetExceptionNoSuchMethodException
-
getInstance
public static Index getInstance(CharSequence uri, boolean randomAccess, boolean documentSizes) throws IOException, org.apache.commons.configuration.ConfigurationException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
Returns a new index using the given URI, searching dynamically for term and prefix maps.- Parameters:
uri- the URI defining the index.randomAccess- whether the index should be accessible randomly.documentSizes- if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).- Throws:
IOExceptionorg.apache.commons.configuration.ConfigurationExceptionURISyntaxExceptionClassNotFoundExceptionSecurityExceptionInstantiationExceptionIllegalAccessExceptionInvocationTargetExceptionNoSuchMethodException- See Also:
getInstance(CharSequence, boolean, boolean, boolean)
-
getInstance
public static Index getInstance(CharSequence uri, boolean randomAccess) throws org.apache.commons.configuration.ConfigurationException, IOException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
Returns a new index using the given URI, searching dynamically for term and prefix maps and loading document sizes only if it is necessary.- Parameters:
uri- the URI defining the index.randomAccess- whether the index should be accessible randomly.- Throws:
org.apache.commons.configuration.ConfigurationExceptionIOExceptionURISyntaxExceptionClassNotFoundExceptionSecurityExceptionInstantiationExceptionIllegalAccessExceptionInvocationTargetExceptionNoSuchMethodException- See Also:
getInstance(CharSequence, boolean, boolean)
-
getInstance
public static Index getInstance(CharSequence uri) throws org.apache.commons.configuration.ConfigurationException, IOException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
Returns a new index using the given URI, searching dynamically for term and prefix maps, loading offsets but loading document sizes only if it is necessary.- Parameters:
uri- the URI defining the index.- Throws:
org.apache.commons.configuration.ConfigurationExceptionIOExceptionURISyntaxExceptionClassNotFoundExceptionSecurityExceptionInstantiationExceptionIllegalAccessExceptionInvocationTargetExceptionNoSuchMethodException- See Also:
getInstance(CharSequence, boolean)
-
getEmptyIndexIterator
public IndexIterator getEmptyIndexIterator()
-
getEmptyIndexIterator
public IndexIterator getEmptyIndexIterator(long term)
-
getEmptyIndexIterator
public IndexIterator getEmptyIndexIterator(CharSequence term)
-
getEmptyIndexIterator
public IndexIterator getEmptyIndexIterator(CharSequence term, long termNumber)
-
getReader
public IndexReader getReader() throws IOException
Creates and returns a newIndexReaderbased on this index, using the default buffer size. After that, you can use the reader to read this index.- Returns:
- a new
IndexReaderto read this index. - Throws:
IOException
-
getReader
public abstract IndexReader getReader(int bufferSize) throws IOException
Creates and returns a newIndexReaderbased on this index. After that, you can use the reader to read this index.- Parameters:
bufferSize- the size of the buffer to be used accessing the reader, or -1 for a default buffer size.- Returns:
- a new
IndexReaderto read this index. - Throws:
IOException
-
documents
public IndexIterator documents(long term) throws IOException
Creates a newIndexReaderfor this index and uses it to return an index iterator over the documents containing a term.Since the reader is created from scratch, it is essential to dispose the returned iterator after usage. See
IndexReader.documents(long)for a method with the same semantics, but making reader reuse possible.- Parameters:
term- a term.- Throws:
IOException- if an exception occurred while accessing the index.UnsupportedOperationException- if this index is not accessible by term number.- See Also:
IndexReader.documents(long)
-
documents
public IndexIterator documents(CharSequence term) throws IOException
Creates a newIndexReaderfor this index and uses it to return an index iterator over the documents containing a term; the term is given explicitly, and the index term map is used, if present.Since the reader is created from scratch, it is essential to dispose the returned iterator after usage. See
IndexReader.documents(long)for a method with the same semantics, but making reader reuse possible.Unless the term processor of this index is
null, words coming from a query will have to be processed before being used with this method.- Parameters:
term- a term.- Throws:
IOException- if an exception occurred while accessing the index.UnsupportedOperationException- if the term map is not available for this index.- See Also:
IndexReader.documents(CharSequence)
-
documents
public IndexIterator documents(CharSequence prefix, int limit) throws IOException, TooManyTermsException
Creates a number of instances ofIndexReaderfor this index and uses them to return aMultiTermIndexIteratorover the documents containing any term our of a set of terms defined by a prefix; the prefix is given explicitly, and unless the index has a prefix map, anUnsupportedOperationExceptionwill be thrown.- Parameters:
prefix- a prefix.limit- a limit on the number of terms that will be used to resolve the prefix query; if the terms starting withprefixare more thanlimit, aTooManyTermsExceptionwill be thrown.- Throws:
UnsupportedOperationException- if this index cannot resolve prefixes.TooManyTermsException- if there are more thanlimitterms starting withprefix.IOException
-
keyIndex
public void keyIndex(Index newKeyIndex)
Sets the index used as a key to retrieve intervals from iterators generated from this index.This setter is a compromise between clarity of design and efficiency. Each index iterator is based on an index, and when that index is passed to
DocumentIterator.intervalIterator(Index), intervals corresponding to the positions of the term in the current document are returned. Analogously,DocumentIterator.indices()returns a singleton set containing the index. However, when composing indices into clusters, often iterators generated by a local index must act as if they really belong to the global index. This method allows to set the index that is used as a key to return intervals, and that is contained insingletonSet.Note that setting this value will only influence index readers created afterwards.
- Parameters:
newKeyIndex- the new index to be used as a key for interval retrieval.
-
-