Class FileSetDocumentCollection
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.AbstractDocumentSequence
-
- it.unimi.di.big.mg4j.document.AbstractDocumentCollection
-
- it.unimi.di.big.mg4j.document.FileSetDocumentCollection
-
- All Implemented Interfaces:
DocumentCollection,DocumentSequence,SafelyCloseable,FlyweightPrototype<DocumentCollection>,Closeable,Serializable,AutoCloseable
public class FileSetDocumentCollection extends AbstractDocumentCollection implements Serializable
ADocumentCollectioncorresponding to a given set of files.This class provides a main method with a flexible syntax that serialises into a document collection a list of files given on the command line or piped into standard input. Optionally, you can provide a parallel list of URIs that will be associated with each file.
Warning: the number of file is limited by
Integer.MAX_VALUE.- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
AbstractDocumentCollection.PropertyKeys
-
-
Field Summary
-
Fields inherited from interface it.unimi.di.big.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
-
-
Constructor Summary
Constructors Constructor Description FileSetDocumentCollection(String[] file, DocumentFactory factory)Builds a document collection corresponding to a given set of files specified as an array.FileSetDocumentCollection(String[] file, String[] uri, DocumentFactory factory)Builds a document collection corresponding to a given set of files specified as an array and a parallel array of URIs, one for each file.
-
Method Summary
Modifier and Type Method Description voidclose()Closes this document sequence, releasing all resources.FileSetDocumentCollectioncopy()Documentdocument(long index)Returns the document given its index.DocumentFactoryfactory()Returns the factory used by this sequence.static voidmain(String[] arg)Reference2ObjectMap<Enum<?>,Object>metadata(long index)Returns the metadata map for a document.longsize()Returns the number of documents in this collection.InputStreamstream(long index)Returns an input stream for the raw content of a document.-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, iterator, printAllDocuments, toString
-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentSequence
filename, finalize, load
-
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface it.unimi.di.big.mg4j.document.DocumentSequence
filename
-
-
-
-
Constructor Detail
-
FileSetDocumentCollection
public FileSetDocumentCollection(String[] file, DocumentFactory factory)
Builds a document collection corresponding to a given set of files specified as an array.Beware. This class is not guaranteed to work if files are deleted or modified after creation!
- Parameters:
file- an array containing the files that will be contained in the collection.factory- the factory that will be used to create documents.
-
FileSetDocumentCollection
public FileSetDocumentCollection(String[] file, String[] uri, DocumentFactory factory)
Builds a document collection corresponding to a given set of files specified as an array and a parallel array of URIs, one for each file.Beware. This class is not guaranteed to work if files are deleted or modified after creation!
- Parameters:
file- an array containing the files that will be contained in the collection.uri- an array, parallel tofile, containing URIs to be associated with each element offile.factory- the factory that will be used to create documents.
-
-
Method Detail
-
factory
public DocumentFactory factory()
Description copied from interface:DocumentSequenceReturns the factory used by this sequence.Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
- Specified by:
factoryin interfaceDocumentSequence- Returns:
- the factory used by this sequence.
-
size
public long size()
Description copied from interface:DocumentCollectionReturns the number of documents in this collection.- Specified by:
sizein interfaceDocumentCollection- Returns:
- the number of documents in this collection.
-
metadata
public Reference2ObjectMap<Enum<?>,Object> metadata(long index)
Description copied from interface:DocumentCollectionReturns the metadata map for a document.- Specified by:
metadatain interfaceDocumentCollection- Parameters:
index- an index between 0 (inclusive) andDocumentCollection.size()(exclusive).- Returns:
- the metadata map for the document.
-
document
public Document document(long index) throws IOException
Description copied from interface:DocumentCollectionReturns the document given its index.- Specified by:
documentin interfaceDocumentCollection- Parameters:
index- an index between 0 (inclusive) andDocumentCollection.size()(exclusive).- Returns:
- the
index-th document. - Throws:
IOException
-
stream
public InputStream stream(long index) throws IOException
Description copied from interface:DocumentCollectionReturns an input stream for the raw content of a document.- Specified by:
streamin interfaceDocumentCollection- Parameters:
index- an index between 0 (inclusive) andDocumentCollection.size()(exclusive).- Returns:
- the raw content of the document as an input stream.
- Throws:
IOException
-
copy
public FileSetDocumentCollection copy()
- Specified by:
copyin interfaceDocumentCollection- Specified by:
copyin interfaceFlyweightPrototype<DocumentCollection>
-
close
public void close() throws IOExceptionDescription copied from interface:DocumentSequenceCloses this document sequence, releasing all resources.You should always call this method after having finished with this document sequence. Implementations are invited to call this method in a finaliser as a safety net (even better, implement
SafelyCloseable), but since there is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Specified by:
closein interfaceDocumentSequence- Overrides:
closein classAbstractDocumentSequence- Throws:
IOException
-
main
public static void main(String[] arg) throws IOException, com.martiansoftware.jsap.JSAPException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
- Throws:
IOExceptioncom.martiansoftware.jsap.JSAPExceptionInstantiationExceptionIllegalAccessExceptionInvocationTargetExceptionNoSuchMethodException
-
-