Class SimpleCompressedDocumentCollection
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.AbstractDocumentSequence
-
- it.unimi.di.big.mg4j.document.AbstractDocumentCollection
-
- it.unimi.di.big.mg4j.document.SimpleCompressedDocumentCollection
-
- All Implemented Interfaces:
DocumentCollection,DocumentSequence,SafelyCloseable,FlyweightPrototype<DocumentCollection>,Closeable,Serializable,AutoCloseable
public class SimpleCompressedDocumentCollection extends AbstractDocumentCollection implements Serializable
A basic, compressed document collection that can be easily built at indexing time.Instances of this class record virtual and non-text fields just like
ZipDocumentCollection—that is, in a zip file. However, text fields are recorded in a simple but highly efficient format. Terms (and nonterms) are numbered globally in an increasing way as they are met. While we scan each document, we keep track of frequencies for a limited number of terms: terms are encoded with their frequency rank if we know their statistics, or by a special code derived from their global number if we have no statistics about them. Every number involved is written in delta code.A collection can be exact or approximated: in the latter case, nonwords will not be recorded, and will be turned into spaces when decompressing.
A instance of this collection will be, as any other collection, serialised on a file, but it will refer to several other files that are derived from the instance basename. Please use
AbstractDocumentSequence.load(CharSequence)to load instances of this collection.This class suffers the same scalability problem of
ZipDocumentCollectionif you compress non-text or virtual fields. Text compression, on the other hand, is extremely efficient and scalable.- Author:
- Sebastiano Vigna
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static classSimpleCompressedDocumentCollection.FrequencyCodecA simple codec for integers that remaps frequent numbers to smaller numbers.-
Nested classes/interfaces inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
AbstractDocumentCollection.PropertyKeys
-
-
Field Summary
Fields Modifier and Type Field Description protected static booleanASSERTSstatic StringDOCUMENT_OFFSETS_EXTENSIONStandard extension for the file containing document offsets stored as δ-encoded gaps.static StringDOCUMENTS_EXTENSIONStandard extension for the file containing encoded documents.static StringNONTERM_OFFSETS_EXTENSIONStandard extension for the file containing nonterm offsets stored as δ-encoded gaps.static StringNONTERMS_EXTENSIONStandard extension for the file containing nonterms inMutableString.writeSelfDelimUTF8(java.io.OutputStream)format.static StringSTATS_EXTENSIONStandard extension for the stats file.static StringTERM_OFFSETS_EXTENSIONStandard extension for the file containing term offsets stored as δ-encoded gaps.static StringTERMS_EXTENSIONStandard extension for the file containing terms inMutableString.writeSelfDelimUTF8(java.io.OutputStream)format.-
Fields inherited from interface it.unimi.di.big.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
-
-
Constructor Summary
Constructors Modifier Constructor Description protectedSimpleCompressedDocumentCollection(String basename, long documents, long terms, long nonTerms, boolean exact, DocumentFactory factory)
-
Method Summary
Modifier and Type Method Description voidclose()Closes this document sequence, releasing all resources.DocumentCollectioncopy()Documentdocument(long index)Returns the document given its index.DocumentFactoryfactory()Returns the factory used by this sequence.voidfilename(CharSequence filename)Does nothing.static voidmain(String[] arg)Reference2ObjectMap<Enum<?>,Object>metadata(long index)Returns the metadata map for a document.static voidoptimize(CharSequence basename)longsize()Returns the number of documents in this collection.InputStreamstream(long index)Returns an input stream for the raw content of a document.-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, iterator, printAllDocuments, toString
-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentSequence
finalize, load
-
-
-
-
Field Detail
-
ASSERTS
protected static final boolean ASSERTS
- See Also:
- Constant Field Values
-
DOCUMENTS_EXTENSION
public static final String DOCUMENTS_EXTENSION
Standard extension for the file containing encoded documents.- See Also:
- Constant Field Values
-
DOCUMENT_OFFSETS_EXTENSION
public static final String DOCUMENT_OFFSETS_EXTENSION
Standard extension for the file containing document offsets stored as δ-encoded gaps.- See Also:
- Constant Field Values
-
TERMS_EXTENSION
public static final String TERMS_EXTENSION
Standard extension for the file containing terms inMutableString.writeSelfDelimUTF8(java.io.OutputStream)format.- See Also:
- Constant Field Values
-
TERM_OFFSETS_EXTENSION
public static final String TERM_OFFSETS_EXTENSION
Standard extension for the file containing term offsets stored as δ-encoded gaps.- See Also:
- Constant Field Values
-
NONTERMS_EXTENSION
public static final String NONTERMS_EXTENSION
Standard extension for the file containing nonterms inMutableString.writeSelfDelimUTF8(java.io.OutputStream)format.- See Also:
- Constant Field Values
-
NONTERM_OFFSETS_EXTENSION
public static final String NONTERM_OFFSETS_EXTENSION
Standard extension for the file containing nonterm offsets stored as δ-encoded gaps.- See Also:
- Constant Field Values
-
STATS_EXTENSION
public static final String STATS_EXTENSION
Standard extension for the stats file.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
SimpleCompressedDocumentCollection
protected SimpleCompressedDocumentCollection(String basename, long documents, long terms, long nonTerms, boolean exact, DocumentFactory factory)
-
-
Method Detail
-
filename
public void filename(CharSequence filename) throws IOException
Description copied from class:AbstractDocumentSequenceDoes nothing.- Specified by:
filenamein interfaceDocumentSequence- Overrides:
filenamein classAbstractDocumentSequence- Parameters:
filename- the filename of this document sequence.- Throws:
IOException
-
copy
public DocumentCollection copy()
- Specified by:
copyin interfaceDocumentCollection- Specified by:
copyin interfaceFlyweightPrototype<DocumentCollection>
-
document
public Document document(long index) throws IOException
Description copied from interface:DocumentCollectionReturns the document given its index.- Specified by:
documentin interfaceDocumentCollection- Parameters:
index- an index between 0 (inclusive) andDocumentCollection.size()(exclusive).- Returns:
- the
index-th document. - Throws:
IOException
-
metadata
public Reference2ObjectMap<Enum<?>,Object> metadata(long index) throws IOException
Description copied from interface:DocumentCollectionReturns the metadata map for a document.- Specified by:
metadatain interfaceDocumentCollection- Parameters:
index- an index between 0 (inclusive) andDocumentCollection.size()(exclusive).- Returns:
- the metadata map for the document.
- Throws:
IOException
-
size
public long size()
Description copied from interface:DocumentCollectionReturns the number of documents in this collection.- Specified by:
sizein interfaceDocumentCollection- Returns:
- the number of documents in this collection.
-
stream
public InputStream stream(long index) throws IOException
Description copied from interface:DocumentCollectionReturns an input stream for the raw content of a document.- Specified by:
streamin interfaceDocumentCollection- Parameters:
index- an index between 0 (inclusive) andDocumentCollection.size()(exclusive).- Returns:
- the raw content of the document as an input stream.
- Throws:
IOException
-
close
public void close() throws IOExceptionDescription copied from interface:DocumentSequenceCloses this document sequence, releasing all resources.You should always call this method after having finished with this document sequence. Implementations are invited to call this method in a finaliser as a safety net (even better, implement
SafelyCloseable), but since there is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Specified by:
closein interfaceDocumentSequence- Overrides:
closein classAbstractDocumentSequence- Throws:
IOException
-
factory
public DocumentFactory factory()
Description copied from interface:DocumentSequenceReturns the factory used by this sequence.Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
- Specified by:
factoryin interfaceDocumentSequence- Returns:
- the factory used by this sequence.
-
optimize
public static void optimize(CharSequence basename) throws IOException, ClassNotFoundException
- Throws:
IOExceptionClassNotFoundException
-
main
public static void main(String[] arg) throws IOException, com.martiansoftware.jsap.JSAPException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException, org.apache.commons.configuration.ConfigurationException, ClassNotFoundException
- Throws:
IOExceptioncom.martiansoftware.jsap.JSAPExceptionInstantiationExceptionIllegalAccessExceptionInvocationTargetExceptionNoSuchMethodExceptionorg.apache.commons.configuration.ConfigurationExceptionClassNotFoundException
-
-