Class IndexBuilder
- java.lang.Object
-
- it.unimi.di.big.mg4j.tool.IndexBuilder
-
public class IndexBuilder extends Object
An index builder.An instance of this class exposes a
run()method that will index theDocumentSequenceprovided at construction time by callingScanandCombinein sequence.Additionally, a main method provides easy access to index construction.
All indexing parameters are available either as chainable setters that can be called optionally before invoking
run(), or as public mutable collections and maps. For instance,new IndexBuilder( "foo", sequence ).skips( true ).run();
will build an index with basename foo using skips. If instead we want to index just the first field of the sequence, and use aShiftAddXorSignedStringMapas a term map, we can use the following code:new IndexBuilder( "foo", sequence ) .termMapClass( ShiftAddXorSignedMinimalPerfectHash.class ) .indexedFields( 0 ).run();More sophisticated modifications can be applied using public maps:
IndexBuilder indexBuilder = new IndexBuilder( "foo", sequence ); indexBuilder.virtualDocumentGaps.put( 0, 30 ); indexBuilder.virtualDocumentResolver.put( 0, someVirtualDocumentResolver ); indexBuilder.run();
-
-
Field Summary
Fields Modifier and Type Field Description IntSortedSetindexedFieldsThe set of indexed fields (expressed as field indices).Int2IntMapvirtualDocumentGapsA map from field indices to virtual gaps.Int2ObjectMap<VirtualDocumentResolver>virtualDocumentResolversA map from field indices to a correspondingVirtualDocumentResolver.
-
Constructor Summary
Constructors Constructor Description IndexBuilder(String basename, DocumentSequence documentSequence)Creates a new index builder with default parameters.
-
Method Summary
Modifier and Type Method Description IndexBuilderbatchDirName(String batchDirName)Sets the temporary directory for batches (default: the directory containing the basename).IndexBuilderbufferSize(int bufferSize)Sets both the scan buffer size and the combine buffer size.IndexBuilderbuilder(DocumentCollectionBuilder builder)Sets the document collection builder (default:null).IndexBuildercombineBufferSize(int bufferSize)Sets theCombinebuffer size (default:Combine.DEFAULT_BUFFER_SIZE).IndexBuilderdocumentsPerBatch(int documentsPerBatch)Sets the number of documents per batch (default:Scan.DEFAULT_BATCH_SIZE).IndexBuilderheight(int height)Sets the skip height (default:BitStreamIndex.DEFAULT_HEIGHT).IndexBuilderindexedFields(int... field)Sets the indexed fields to those provided (default: all fields, but seeindexedFields).IndexBuilderindexType(Combine.IndexType indexType)Sets the type of the index to be built (default:Combine.IndexType.QUASI_SUCCINCT).IndexBuilderinterleaved(boolean interleaved)Sets the interleaved flag (default: false).IndexBuilderioFactory(IOFactory ioFactory)Sets the I/O factory (default:IOFactory.FILESYSTEM_FACTORY).IndexBuilderkeepBatches(boolean keepBatches)Sets the “keep batches” flag (default: false).IndexBuilderlogInterval(long logInterval)Sets the logging time interval (default:ProgressLogger.DEFAULT_LOG_INTERVAL).static voidmain(String[] arg)IndexBuildermapFile(String mapFile)Sets the name of a file containing a map on the document indices (default:null).IndexBuildermaxTerms(int maxTerms)Sets the maximum number of overall (i.e., cross-field) terms per batch (default:Scan.DEFAULT_BATCH_SIZE).IndexBuilderpasteBufferSize(int bufferSize)Sets the size in byte of the internal buffer using when pasting indices (default:Paste.DEFAULT_MEMORY_BUFFER_SIZE).IndexBuilderpayloadWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> payloadWriterFlags)Sets the writer compression flags for payload-based indices (default:CompressionFlags.DEFAULT_PAYLOAD_INDEX).IndexBuilderquantum(int quantum)Sets the skip quantum (default:BitStreamIndex.DEFAULT_QUANTUM).IndexBuilderquasiSuccinctWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> quasiSuccinctWriterFlags)Sets the writer compression flags for standard indices (default:CompressionFlags.DEFAULT_QUASI_SUCCINCT_INDEX).voidrun()Builds the index.IndexBuilderscanBufferSize(int bufferSize)Sets theScanbuffer size (default:Scan.DEFAULT_BUFFER_SIZE).IndexBuilderskipBufferSize(int bufferSize)Sets the size in byte of the internal buffer using during the construction of a index with skips (default:SkipBitStreamIndexWriter.DEFAULT_TEMP_BUFFER_SIZE).IndexBuilderskips(boolean skips)Sets the skip flag (default: true).IndexBuilderstandardWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> standardWriterFlags)Sets the writer compression flags for standard indices (default:CompressionFlags.DEFAULT_STANDARD_INDEX).IndexBuildertermMapClass(Class<? extends StringMap<? extends CharSequence>> termMapClass)Sets the class used to build the index term map (default:ImmutableExternalPrefixMap).IndexBuildertermProcessor(TermProcessor termProcessor)Sets the term processor (default:DowncaseTermProcessor).IndexBuildervirtualDocumentResolver(int field, VirtualDocumentResolver virtualDocumentResolver)Adds a virtual document resolver tovirtualDocumentResolvers.
-
-
-
Field Detail
-
indexedFields
public IntSortedSet indexedFields
The set of indexed fields (expressed as field indices). If left empty, all fields will be indexed, with the proviso that fields of typeDocumentFactory.FieldType.VIRTUALwill be indexed only if they have a correspondingVirtualDocumentResolver.An alternative, chained access to this map is provided by the method
indexedFields(int[])After calling
run(), this map will contain the set of fields actually indexed.
-
virtualDocumentResolvers
public Int2ObjectMap<VirtualDocumentResolver> virtualDocumentResolvers
A map from field indices to a correspondingVirtualDocumentResolver.
-
virtualDocumentGaps
public Int2IntMap virtualDocumentGaps
A map from field indices to virtual gaps. Only values associated with fields of typeDocumentFactory.FieldType.VIRTUALare meaningful, and the default return value is set foScan.DEFAULT_VIRTUAL_DOCUMENT_GAP. You can either add entries, or change the default return value.
-
-
Constructor Detail
-
IndexBuilder
public IndexBuilder(String basename, DocumentSequence documentSequence)
Creates a new index builder with default parameters.Note, in particular, that the resulting index will be a BitStreamHPIndex (unless you require payloads, in which case it will be a
BitStreamIndexwith skips), and that all terms will be downcased. You can set more finely the type of index usinginterleaved(boolean)andskips(boolean).- Parameters:
basename- the basename from which all files will be stemmed.documentSequence- the document sequence to be indexed.
-
-
Method Detail
-
ioFactory
public IndexBuilder ioFactory(IOFactory ioFactory)
Sets the I/O factory (default:IOFactory.FILESYSTEM_FACTORY).- Parameters:
ioFactory- the I/O factory.- Returns:
- this index builder.
-
termProcessor
public IndexBuilder termProcessor(TermProcessor termProcessor)
Sets the term processor (default:DowncaseTermProcessor).- Parameters:
termProcessor- the term processor.- Returns:
- this index builder.
-
builder
public IndexBuilder builder(DocumentCollectionBuilder builder)
Sets the document collection builder (default:null).- Parameters:
builder- a document-collection builder class that will be used to build a collection during the indexing phase.- Returns:
- this index builder.
-
indexedFields
public IndexBuilder indexedFields(int... field)
Sets the indexed fields to those provided (default: all fields, but seeindexedFields).This is a utility method that provides a way to set
indexedFieldsin a chainable way.- Parameters:
field- a list of fields to be indexed, that will replace the current values inindexedFields.- Returns:
- this index builder.
- See Also:
indexedFields
-
virtualDocumentResolver
public IndexBuilder virtualDocumentResolver(int field, VirtualDocumentResolver virtualDocumentResolver)
Adds a virtual document resolver tovirtualDocumentResolvers.This is a utility method that provides a way to put an element into
virtualDocumentResolversin a chainable way.- Parameters:
field- a field index.virtualDocumentResolver- a virtual document resolver.- Returns:
- this index builder.
- See Also:
virtualDocumentResolvers
-
scanBufferSize
public IndexBuilder scanBufferSize(int bufferSize)
Sets theScanbuffer size (default:Scan.DEFAULT_BUFFER_SIZE).- Parameters:
bufferSize- a buffer size forScan.- Returns:
- this index builder.
-
combineBufferSize
public IndexBuilder combineBufferSize(int bufferSize)
Sets theCombinebuffer size (default:Combine.DEFAULT_BUFFER_SIZE).- Parameters:
bufferSize- a buffer size forCombine.- Returns:
- this index builder.
-
bufferSize
public IndexBuilder bufferSize(int bufferSize)
Sets both the scan buffer size and the combine buffer size.- Parameters:
bufferSize- a buffer size.- Returns:
- this index builder.
-
skipBufferSize
public IndexBuilder skipBufferSize(int bufferSize)
Sets the size in byte of the internal buffer using during the construction of a index with skips (default:SkipBitStreamIndexWriter.DEFAULT_TEMP_BUFFER_SIZE).- Parameters:
bufferSize- a buffer size forSkipBitStreamIndexWriter.- Returns:
- this index builder.
-
pasteBufferSize
public IndexBuilder pasteBufferSize(int bufferSize)
Sets the size in byte of the internal buffer using when pasting indices (default:Paste.DEFAULT_MEMORY_BUFFER_SIZE).- Parameters:
bufferSize- a buffer size forPaste.- Returns:
- this index builder.
-
documentsPerBatch
public IndexBuilder documentsPerBatch(int documentsPerBatch)
Sets the number of documents per batch (default:Scan.DEFAULT_BATCH_SIZE).- Parameters:
documentsPerBatch- the number of documentsScanwill attempt to add to each batch.- Returns:
- this index builder.
-
maxTerms
public IndexBuilder maxTerms(int maxTerms)
Sets the maximum number of overall (i.e., cross-field) terms per batch (default:Scan.DEFAULT_BATCH_SIZE).- Parameters:
maxTerms- the maximum number of overall (i.e., cross-field) termsScanwill attempt to add to each batch.- Returns:
- this index builder.
-
keepBatches
public IndexBuilder keepBatches(boolean keepBatches)
Sets the “keep batches” flag (default: false). If true, the temporary batch files generated during index construction wil not be deleted.- Parameters:
keepBatches- the new value for the “keep batches” flag.- Returns:
- this index builder.
-
standardWriterFlags
public IndexBuilder standardWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> standardWriterFlags)
Sets the writer compression flags for standard indices (default:CompressionFlags.DEFAULT_STANDARD_INDEX).- Parameters:
standardWriterFlags- the flags for standard indices.- Returns:
- this index builder.
-
quasiSuccinctWriterFlags
public IndexBuilder quasiSuccinctWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> quasiSuccinctWriterFlags)
Sets the writer compression flags for standard indices (default:CompressionFlags.DEFAULT_QUASI_SUCCINCT_INDEX).- Parameters:
quasiSuccinctWriterFlags- the flags for quasi-succinct indices.- Returns:
- this index builder.
-
payloadWriterFlags
public IndexBuilder payloadWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> payloadWriterFlags)
Sets the writer compression flags for payload-based indices (default:CompressionFlags.DEFAULT_PAYLOAD_INDEX).- Parameters:
payloadWriterFlags- the flags for payload-based indices.- Returns:
- this index builder.
-
skips
public IndexBuilder skips(boolean skips)
Sets the skip flag (default: true). If true, the index will have a skipping structure. The flag is a no-op unless you require an interleaved index, as high-performance indices always have skips.- Parameters:
skips- the new value for the skip flag.- Returns:
- this index builder.
-
interleaved
public IndexBuilder interleaved(boolean interleaved)
Sets the interleaved flag (default: false). If true, the index will be forced to be an interleaved index (but note that in a number of cases, such as missing index components or payloads, the index will be necessarily interleaved).- Parameters:
interleaved- the new value for the interleaved flag.- Returns:
- this index builder.
-
indexType
public IndexBuilder indexType(Combine.IndexType indexType)
Sets the type of the index to be built (default:Combine.IndexType.QUASI_SUCCINCT).- Parameters:
indexType- the desired index type.- Returns:
- this index builder.
-
quantum
public IndexBuilder quantum(int quantum)
Sets the skip quantum (default:BitStreamIndex.DEFAULT_QUANTUM).- Parameters:
quantum- the skip quantum.- Returns:
- this index builder.
-
height
public IndexBuilder height(int height)
Sets the skip height (default:BitStreamIndex.DEFAULT_HEIGHT).- Parameters:
height- the skip height.- Returns:
- this index builder.
-
mapFile
public IndexBuilder mapFile(String mapFile)
Sets the name of a file containing a map on the document indices (default:null).The provided file must containing integers in
DataOutputformat. They must by as many as the number of documents in the collection provided at construction time, and the resulting function must be injective (i.e., there must be no duplicates).- Parameters:
mapFile- a file representing a document map (ornullfor no mapping).- Returns:
- this index builder.
-
logInterval
public IndexBuilder logInterval(long logInterval)
Sets the logging time interval (default:ProgressLogger.DEFAULT_LOG_INTERVAL).- Parameters:
logInterval- the logging time interval.- Returns:
- this index builder.
-
batchDirName
public IndexBuilder batchDirName(String batchDirName)
Sets the temporary directory for batches (default: the directory containing the basename).- Parameters:
batchDirName- the name of the temporary directory for batches, ornullfor the directory containing the basename.- Returns:
- this index builder.
-
termMapClass
public IndexBuilder termMapClass(Class<? extends StringMap<? extends CharSequence>> termMapClass)
Sets the class used to build the index term map (default:ImmutableExternalPrefixMap).The only requirement for
termMapClass(besides, of course, implementingStringMap) is that of having a public constructor accepting a single parameter of typeIterable<CharSequence>.- Parameters:
termMapClass- the class used to build the index term map, ornullto disable the construction of a term map.- Returns:
- this index builder.
-
run
public void run() throws org.apache.commons.configuration.ConfigurationException, SecurityException, IOException, URISyntaxException, ClassNotFoundException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodExceptionBuilds the index.This method simply invokes
ScanandCombineusing the internally stored settings, and finally builds aStringMap.If the provided document sequence can be iterated over several times, this method can be called several times, too, rebuilding each time the index.
- Throws:
org.apache.commons.configuration.ConfigurationExceptionSecurityExceptionIOExceptionURISyntaxExceptionClassNotFoundExceptionInstantiationExceptionIllegalAccessExceptionInvocationTargetExceptionNoSuchMethodException
-
main
public static void main(String[] arg) throws com.martiansoftware.jsap.JSAPException, InvocationTargetException, NoSuchMethodException, IllegalAccessException, org.apache.commons.configuration.ConfigurationException, ClassNotFoundException, IOException, InstantiationException, URISyntaxException
- Throws:
com.martiansoftware.jsap.JSAPExceptionInvocationTargetExceptionNoSuchMethodExceptionIllegalAccessExceptionorg.apache.commons.configuration.ConfigurationExceptionClassNotFoundExceptionIOExceptionInstantiationExceptionURISyntaxException
-
-