Class Combine
- java.lang.Object
-
- it.unimi.di.big.mg4j.tool.Combine
-
- Direct Known Subclasses:
Concatenate,Merge,Paste
public abstract class Combine extends Object
Combines several indices.Indices may be combined in several different ways. This abstract class contains code that is common to classes such as
MergeorConcatenate: essentially, command line parsing, index opening, and term list fusion is taken care of. Then, the template methodcombine(int, long)must write intoindexWriterthe combined inverted list. If, however,metadataOnlyis true,indexWriterisnullandcombine(int, long)must just compute the total frequency, occurrency, and sum of maximum positions.Note that by combining a single index into a new one you can recompress an index with different compression parameters (which includes the possibility of eliminating positions or counts). It is also possible to build just the metadata associated with an index (term list, frequencies, occurrencies).
The subclasses of this class must implement
combine(int, long)so that indices with different sets of features are combined keeping the largest set of features requested by the user. For instance, combining an index with positions and an index with counts, but no positions, should generate an index with counts but no positions.Warning: a combination requires opening three files per input index, plus a few more files for the output index. If the combination process is interrupted by an exception claiming that there are too many open files, check how to increase the number of files you can open (usually, for instance on UN*X, there is a global and a per-process limit, so be sure to set both).
Read-once indices, readers, and distributed index combination
If the indices and bitstream index readers involved in the combination are read-once (i.e., opening an index and reading once its contents sequentially causes each file composing the index to be read exactly once) then also
Combineimplementations should be read-once (Concatenate,MergeandPasteare).This means, in particular, that index combination can be performed from pipes, which in turn can be filled, for instance, with data coming from the network. In other words, albeit this class is theoretically based on a number of indices existing on a local disk, those indices can be substituted with suitable pipes filled with remote data without affecting the combination process. For instance, the following bash code creates three sets of pipes for an interleaved index:
for i in 0 1 2; do for e in frequencies occurrencies index offsets posnumbits sumsmaxpos properties sizes terms; do mkfifo pipe$i.$e done doneEach pipe should be then filled with suitable data, for instance obtained from the net (assuming you have indices index0, index1 and index2 on example.com):
for i in 0 1 2; do for e in frequencies occurrencies index offsets posnumbits sumsmaxpos properties sizes terms; do (ssh -x example.com cat index$i.$e >pipe$i.$e &) done doneNow all pipes will be filled with data from the corresponding remote files, and combining the indices pipe0, pipe1 and pipe2 will give the same result as combining index0, index1 and index2 on the remote system.
- Since:
- 1.0
- Author:
- Sebastiano Vigna
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static classCombine.GammaCodedIntIteratorA partialIntIteratorimplementation based on γ-coded integers.static classCombine.IndexType
-
Field Summary
Fields Modifier and Type Field Description protected PropertiesadditionalPropertiesAdditional properties for the merged index.protected intbufferSizeThe size of I/O buffers.static intDEFAULT_BUFFER_SIZEThe default buffer size.protected long[]frequencyFor each index, the frequency of the current term (given that it is present).protected booleanhasCountsWhetherindexWriterhas counts.protected booleanhasPayloadsWhetherindexWriterhas payloads.protected booleanhasPositionsWhetherindexWriterhas positions.protected booleanhaveSumsMaxPosWhether we have the sum of maximum positions for all indices.protected Index[]indexThe array of indices to be merged.protected IndexIterator[]indexIteratorAn array of index iterators parallel toindex(filled by concrete implementations).protected IndexReader[]indexReaderAn array of index readers parallel toindex.protected IndexWriterindexWriterThe index writer for the merged index.protected String[]inputBasenameThe array of input basenames.protected IOFactoryioFactoryThe I/O factory that will be used to create files.protected intmaxCountThe maximum count in the merged index.protected booleanmetadataOnlyCompute only index metadata (sizes, terms and occurrencies).protected booleanneedsSizesTrue if the index writer needs sizes (usually, because it uses Golomb or interpolative coding for its positions).protected longnumberOfDocumentsThe overall number of documents.protected longnumberOfOccurrencesThe overall number of occurrences.protected intnumIndicesThe number of indices to be merged.protected StringoutputBasenameThe output basename.protected doublepIf nonzero, the fraction of space to be used by variable-quantum skip towers.protected int[]positionArrayA temporary place to write positions.protected longpredictedLengthNumBitsThe predicted number of bits for the positions the next inverted list to be combined.protected longpredictedSizeThe predicted size of the non-positional part of next inverted list to be combined.protected QuasiSuccinctIndexWriterquasiSuccinctIndexWriterprotected int[][]sizeThe big array of sizes of the combined index.protected InputBitStream[]sumsMaxPosAn array of input bit streams, returning sum of maximum positions for each index (used for variable-quantum computation).protected ObjectHeapSemiIndirectPriorityQueue<MutableString>termQueueThe queue containing terms.protected int[]usedIndexAn array partially filled with the indices (as offsets inindex) participating to the merge process for the current term.protected VariableQuantumIndexWritervariableQuantumIndexWriterA copy ofindexWriterwhich is non-nullifindexWriteris an instance ofVariableQuantumIndexWriter.
-
Constructor Summary
Constructors Constructor Description Combine(IOFactory ioFactory, String outputBasename, String[] inputBasename, boolean metadataOnly, boolean requireSizes, int bufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags, Combine.IndexType indexType, boolean skips, int quantum, int height, int skipBufferOrCacheSize, long logInterval)Combines several indices into one.Combine(IOFactory ioFactory, String outputBasename, String[] inputBasename, IntList delete, boolean metadataOnly, boolean requireSizes, int bufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags, Combine.IndexType indexType, boolean skips, int quantum, int height, int skipBufferOrCacheSize, long logInterval)Combines several indices into one.
-
Method Summary
Modifier and Type Method Description protected abstract longcombine(int numUsedIndices, long occurrency)Combines several indices.protected abstract longcombineNumberOfDocuments()Combines the number of documents.protected abstract intcombineSizes(OutputBitStream sizeOutputBitStream)Combines size lists.static voidmain(String[] arg)static voidmain(String[] arg, Class<? extends Combine> combineClass)voidrun()protected IntIteratorsizes(int numIndex)Returns an iterator on sizes.
-
-
-
Field Detail
-
DEFAULT_BUFFER_SIZE
public static final int DEFAULT_BUFFER_SIZE
The default buffer size.- See Also:
- Constant Field Values
-
ioFactory
protected final IOFactory ioFactory
The I/O factory that will be used to create files.
-
numIndices
protected final int numIndices
The number of indices to be merged.
-
index
protected final Index[] index
The array of indices to be merged.
-
indexReader
protected final IndexReader[] indexReader
An array of index readers parallel toindex.
-
indexIterator
protected final IndexIterator[] indexIterator
An array of index iterators parallel toindex(filled by concrete implementations).
-
metadataOnly
protected final boolean metadataOnly
Compute only index metadata (sizes, terms and occurrencies).
-
sumsMaxPos
protected final InputBitStream[] sumsMaxPos
An array of input bit streams, returning sum of maximum positions for each index (used for variable-quantum computation).
-
haveSumsMaxPos
protected boolean haveSumsMaxPos
Whether we have the sum of maximum positions for all indices.
-
termQueue
protected ObjectHeapSemiIndirectPriorityQueue<MutableString> termQueue
The queue containing terms.
-
numberOfDocuments
protected final long numberOfDocuments
The overall number of documents.
-
numberOfOccurrences
protected long numberOfOccurrences
The overall number of occurrences.
-
maxCount
protected int maxCount
The maximum count in the merged index.
-
inputBasename
protected final String[] inputBasename
The array of input basenames.
-
outputBasename
protected final String outputBasename
The output basename.
-
bufferSize
protected final int bufferSize
The size of I/O buffers.
-
p
protected final double p
If nonzero, the fraction of space to be used by variable-quantum skip towers.
-
indexWriter
protected IndexWriter indexWriter
The index writer for the merged index.
-
variableQuantumIndexWriter
protected VariableQuantumIndexWriter variableQuantumIndexWriter
A copy ofindexWriterwhich is non-nullifindexWriteris an instance ofVariableQuantumIndexWriter.
-
quasiSuccinctIndexWriter
protected QuasiSuccinctIndexWriter quasiSuccinctIndexWriter
-
hasCounts
protected final boolean hasCounts
WhetherindexWriterhas counts.
-
hasPositions
protected final boolean hasPositions
WhetherindexWriterhas positions.
-
hasPayloads
protected final boolean hasPayloads
WhetherindexWriterhas payloads.
-
additionalProperties
protected final Properties additionalProperties
Additional properties for the merged index.
-
usedIndex
protected final int[] usedIndex
An array partially filled with the indices (as offsets inindex) participating to the merge process for the current term.
-
frequency
protected final long[] frequency
For each index, the frequency of the current term (given that it is present).
-
positionArray
protected int[] positionArray
A temporary place to write positions.
-
needsSizes
protected final boolean needsSizes
True if the index writer needs sizes (usually, because it uses Golomb or interpolative coding for its positions).
-
size
protected int[][] size
The big array of sizes of the combined index. This is set up bycombineSizes(OutputBitStream)by the combiners who need it.
-
predictedSize
protected long predictedSize
The predicted size of the non-positional part of next inverted list to be combined. It will be -1, unlesspis not zero.
-
predictedLengthNumBits
protected long predictedLengthNumBits
The predicted number of bits for the positions the next inverted list to be combined. It will be -1, unlesspis not zero.
-
-
Constructor Detail
-
Combine
public Combine(IOFactory ioFactory, String outputBasename, String[] inputBasename, boolean metadataOnly, boolean requireSizes, int bufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags, Combine.IndexType indexType, boolean skips, int quantum, int height, int skipBufferOrCacheSize, long logInterval) throws IOException, org.apache.commons.configuration.ConfigurationException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
Combines several indices into one.- Parameters:
ioFactory- the factory that will be used to perform I/O.outputBasename- the basename of the combined index.inputBasename- the basenames of the input indices.metadataOnly- if true, we save only metadata (term list, frequencies, occurrencies).requireSizes- if true, the sizes of input indices will be forced to be loaded.bufferSize- the buffer size for index readers.writerFlags- the flags for the index writer.indexType- the type of the index to build.skips- whether to insert skips in caseinterleavedis true.quantum- the quantum of skipping structures; if negative, a percentage of space for variable-quantum indices (irrelevant ifskipsis false).height- the height of skipping towers (irrelevant ifskipsis false).skipBufferOrCacheSize- the size of the buffer used to hold temporarily inverted lists during the skipping structure construction, or the size of the bit cache used when building a quasi-succinct index.logInterval- how often we log.- Throws:
IOExceptionorg.apache.commons.configuration.ConfigurationExceptionURISyntaxExceptionClassNotFoundExceptionSecurityExceptionInstantiationExceptionIllegalAccessExceptionInvocationTargetExceptionNoSuchMethodException
-
Combine
public Combine(IOFactory ioFactory, String outputBasename, String[] inputBasename, IntList delete, boolean metadataOnly, boolean requireSizes, int bufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags, Combine.IndexType indexType, boolean skips, int quantum, int height, int skipBufferOrCacheSize, long logInterval) throws IOException, org.apache.commons.configuration.ConfigurationException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
Combines several indices into one.- Parameters:
ioFactory- the factory that will be used to perform I/O.outputBasename- the basename of the combined index.inputBasename- the basenames of the input indices.delete- a monotonically increasing list of integers representing documents that will be deleted from the output index, ornull.metadataOnly- if true, we save only metadata (term list, frequencies, occurrencies).requireSizes- if true, the sizes of input indices will be forced to be loaded.bufferSize- the buffer size for index readers.writerFlags- the flags for the index writer.indexType- the type of the index to build.skips- whether to insert skips in caseinterleavedis true.quantum- the quantum of skipping structures; if negative, a percentage of space for variable-quantum indices (irrelevant ifskipsis false).height- the height of skipping towers (irrelevant ifskipsis false).skipBufferOrCacheSize- the size of the buffer used to hold temporarily inverted lists during the skipping structure construction, or the size of the bit cache used when building a quasi-succinct index.logInterval- how often we log.- Throws:
IOExceptionorg.apache.commons.configuration.ConfigurationExceptionURISyntaxExceptionClassNotFoundExceptionSecurityExceptionInstantiationExceptionIllegalAccessExceptionInvocationTargetExceptionNoSuchMethodException
-
-
Method Detail
-
combineNumberOfDocuments
protected abstract long combineNumberOfDocuments()
Combines the number of documents.- Returns:
- the number of documents of the combined index.
-
sizes
protected IntIterator sizes(int numIndex) throws IOException
Returns an iterator on sizes.The purpose of this method is to provide
combineSizes(OutputBitStream)implementations with a way to access the size list from a disk file or fromIndex.sizestransparently. This mechanism is essential to ensure that size files are read exactly once.The caller should check whether the returned object implements
Closeable, and, in this case, invokeCloseable.close()after usage.- Parameters:
numIndex- the number of an index.- Returns:
- an iterator on the sizes of the index.
- Throws:
IOException
-
combineSizes
protected abstract int combineSizes(OutputBitStream sizeOutputBitStream) throws IOException
Combines size lists.- Returns:
- the maximum size of a document in the combined index.
- Throws:
IOException
-
combine
protected abstract long combine(int numUsedIndices, long occurrency) throws IOExceptionCombines several indices.When this method is called, exactly
numUsedIndicesentries ofusedIndexcontain, in increasing order, the indices containing inverted lists for the current term. Implementations of this method must combine the inverted list and return the total frequency.- Parameters:
numUsedIndices- the number of valid entries inusedIndex.occurrency- the occurrency of the term (used only when buildingCombine.IndexType.QUASI_SUCCINCTindices).- Returns:
- the total frequency.
- Throws:
IOException
-
run
public void run() throws org.apache.commons.configuration.ConfigurationException, IOException- Throws:
org.apache.commons.configuration.ConfigurationExceptionIOException
-
main
public static void main(String[] arg) throws com.martiansoftware.jsap.JSAPException, org.apache.commons.configuration.ConfigurationException, IOException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
- Throws:
com.martiansoftware.jsap.JSAPExceptionorg.apache.commons.configuration.ConfigurationExceptionIOExceptionURISyntaxExceptionClassNotFoundExceptionSecurityExceptionInstantiationExceptionIllegalAccessExceptionInvocationTargetExceptionNoSuchMethodException
-
main
public static void main(String[] arg, Class<? extends Combine> combineClass) throws com.martiansoftware.jsap.JSAPException, org.apache.commons.configuration.ConfigurationException, IOException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
- Throws:
com.martiansoftware.jsap.JSAPExceptionorg.apache.commons.configuration.ConfigurationExceptionIOExceptionURISyntaxExceptionClassNotFoundExceptionSecurityExceptionInstantiationExceptionIllegalAccessExceptionInvocationTargetExceptionNoSuchMethodException
-
-