Class TRECDocumentCollection
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.AbstractDocumentSequence
-
- it.unimi.di.big.mg4j.document.AbstractDocumentCollection
-
- it.unimi.di.big.mg4j.document.TRECDocumentCollection
-
- All Implemented Interfaces:
DocumentCollection,DocumentSequence,SafelyCloseable,FlyweightPrototype<DocumentCollection>,Closeable,Serializable,AutoCloseable
public class TRECDocumentCollection extends AbstractDocumentCollection implements Serializable
A collection for the TREC GOV2 data set.The documents are stored as a set of descriptors, representing the (possibly gzipped) file they are contained in and the start and stop position in that file. To manage descriptors later we rely on
SegmentedInputStream.To interpret a file, we read up to <DOC> and place a start marker there, we advance to the header and store the URI. An intermediate marker is placed at the end of the doc header tag and a stop marker just before </DOC>.
The resulting
SegmentedInputStreamhas two segments per document. By using aCompositeDocumentFactory, the first segment is parsed by aTRECHeaderDocumentFactory, whereas the second segment is parsed by a user-provided factory—usually, anHtmlDocumentFactory.The collection provides both sequential access to all documents via the iterator and random access to a given document. However, the two operations are performed very differently as the sequential operation is much more efficient than calling
document(long)repeatedly.- Author:
- Alessio Orlandi, Luca Natali
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static classTRECDocumentCollection.TRECDocumentDescriptorA compact description of the location and of the internal segmentation of a TREC document inside a file.-
Nested classes/interfaces inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
AbstractDocumentCollection.PropertyKeys
-
-
Field Summary
Fields Modifier and Type Field Description static StringDEFAULT_BUFFER_SIZEDefault buffer size, set up after some experiments.protected ObjectBigArrayBigList<TRECDocumentCollection.TRECDocumentDescriptor>descriptorsThe list of document descriptors.protected static byte[]DOC_CLOSEprotected static byte[]DOC_OPENprotected static byte[]DOCHDR_CLOSEprotected static byte[]DOCHDR_OPENprotected static byte[]DOCNO_CLOSEprotected static byte[]DOCNO_OPENprotected DocumentFactoryfactoryThe document factory.protected String[]fileThe list of the files containing the documents.protected SegmentedInputStreamlastStreamThe last returned stream.protected booleanuseGzipWhether the files infileare gzipped.-
Fields inherited from interface it.unimi.di.big.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
-
-
Constructor Summary
Constructors Modifier Constructor Description TRECDocumentCollection(String[] file, DocumentFactory factory, int bufferSize, boolean useGzip)Creates a new TREC collection by parsing the given files.protectedTRECDocumentCollection(String[] file, DocumentFactory factory, ObjectBigArrayBigList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors, int bufferSize, boolean useGzip)Copy constructor (that is, the one used bycopy().
-
Method Summary
Modifier and Type Method Description voidclose()Closes this document sequence, releasing all resources.TRECDocumentCollectioncopy()Documentdocument(long n)Returns the document given its index.protected static booleanequals(byte[] a, int len, byte[] b)DocumentFactoryfactory()Returns the factory used by this sequence.DocumentIteratoriterator()Returns an iterator over the sequence of documents.static voidmain(String[] arg)voidmerge(TRECDocumentCollection other)Merges a new collection in this one, by rebuilding the gzFile array and appending the other object one, concatenating the descriptors while rebuilding all.Reference2ObjectMap<Enum<?>,Object>metadata(long index)Returns the metadata map for a document.protected voidparseContent(int fileIndex, InputStream is)longsize()Returns the number of documents in this collection.InputStreamstream(long n)Returns an input stream for the raw content of a document.-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, printAllDocuments, toString
-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentSequence
filename, finalize, load
-
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface it.unimi.di.big.mg4j.document.DocumentSequence
filename
-
-
-
-
Field Detail
-
DEFAULT_BUFFER_SIZE
public static final String DEFAULT_BUFFER_SIZE
Default buffer size, set up after some experiments.- See Also:
- Constant Field Values
-
file
protected String[] file
The list of the files containing the documents.
-
useGzip
protected final boolean useGzip
Whether the files infileare gzipped.
-
factory
protected DocumentFactory factory
The document factory.
-
descriptors
protected transient ObjectBigArrayBigList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors
The list of document descriptors. We assume that descriptors within the same file are contiguous
-
lastStream
protected SegmentedInputStream lastStream
The last returned stream.
-
DOC_OPEN
protected static final byte[] DOC_OPEN
-
DOC_CLOSE
protected static final byte[] DOC_CLOSE
-
DOCNO_OPEN
protected static final byte[] DOCNO_OPEN
-
DOCNO_CLOSE
protected static final byte[] DOCNO_CLOSE
-
DOCHDR_OPEN
protected static final byte[] DOCHDR_OPEN
-
DOCHDR_CLOSE
protected static final byte[] DOCHDR_CLOSE
-
-
Constructor Detail
-
TRECDocumentCollection
protected TRECDocumentCollection(String[] file, DocumentFactory factory, ObjectBigArrayBigList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors, int bufferSize, boolean useGzip)
Copy constructor (that is, the one used bycopy(). Just initializes final fields
-
TRECDocumentCollection
public TRECDocumentCollection(String[] file, DocumentFactory factory, int bufferSize, boolean useGzip) throws IOException
Creates a new TREC collection by parsing the given files.- Parameters:
file- an array of file names containing documents in TREC GOV2 format.factory- the document factory (usually, a composite one).bufferSize- the buffer size.useGzip- true iff the files are gzipped.- Throws:
IOException
-
-
Method Detail
-
equals
protected static boolean equals(byte[] a, int len, byte[] b)
-
parseContent
protected void parseContent(int fileIndex, InputStream is) throws IOException- Throws:
IOException
-
copy
public TRECDocumentCollection copy()
- Specified by:
copyin interfaceDocumentCollection- Specified by:
copyin interfaceFlyweightPrototype<DocumentCollection>
-
size
public long size()
Description copied from interface:DocumentCollectionReturns the number of documents in this collection.- Specified by:
sizein interfaceDocumentCollection- Returns:
- the number of documents in this collection.
-
document
public Document document(long n) throws IOException
Description copied from interface:DocumentCollectionReturns the document given its index.- Specified by:
documentin interfaceDocumentCollection- Parameters:
n- an index between 0 (inclusive) andDocumentCollection.size()(exclusive).- Returns:
- the
index-th document. - Throws:
IOException
-
stream
public InputStream stream(long n) throws IOException
Description copied from interface:DocumentCollectionReturns an input stream for the raw content of a document.- Specified by:
streamin interfaceDocumentCollection- Parameters:
n- an index between 0 (inclusive) andDocumentCollection.size()(exclusive).- Returns:
- the raw content of the document as an input stream.
- Throws:
IOException
-
metadata
public Reference2ObjectMap<Enum<?>,Object> metadata(long index)
Description copied from interface:DocumentCollectionReturns the metadata map for a document.- Specified by:
metadatain interfaceDocumentCollection- Parameters:
index- an index between 0 (inclusive) andDocumentCollection.size()(exclusive).- Returns:
- the metadata map for the document.
-
factory
public DocumentFactory factory()
Description copied from interface:DocumentSequenceReturns the factory used by this sequence.Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
- Specified by:
factoryin interfaceDocumentSequence- Returns:
- the factory used by this sequence.
-
close
public void close() throws IOExceptionDescription copied from interface:DocumentSequenceCloses this document sequence, releasing all resources.You should always call this method after having finished with this document sequence. Implementations are invited to call this method in a finaliser as a safety net (even better, implement
SafelyCloseable), but since there is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Specified by:
closein interfaceDocumentSequence- Overrides:
closein classAbstractDocumentSequence- Throws:
IOException
-
merge
public void merge(TRECDocumentCollection other)
Merges a new collection in this one, by rebuilding the gzFile array and appending the other object one, concatenating the descriptors while rebuilding all.It is supposed that the passed object contains no duplicates for the local collection.
-
iterator
public DocumentIterator iterator() throws IOException
Description copied from interface:DocumentSequenceReturns an iterator over the sequence of documents.Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.
Implementations may decide to override this restriction (in particular, if they implement
DocumentCollection). Usually, however, it is not possible to obtain two iterators at the same time on a collection.- Specified by:
iteratorin interfaceDocumentSequence- Overrides:
iteratorin classAbstractDocumentCollection- Returns:
- an iterator over the sequence of documents.
- Throws:
IOException- See Also:
DocumentCollection
-
main
public static void main(String[] arg) throws IOException, com.martiansoftware.jsap.JSAPException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
- Throws:
IOExceptioncom.martiansoftware.jsap.JSAPExceptionInstantiationExceptionIllegalAccessExceptionInvocationTargetExceptionNoSuchMethodException
-
-