|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectit.unimi.di.mg4j.document.AbstractDocumentSequence
it.unimi.di.mg4j.document.AbstractDocumentCollection
it.unimi.di.mg4j.document.ZipDocumentCollection
public class ZipDocumentCollection
A document collection stored in a zip file.
Each instance of this class has an associated zip file. Each Zip entry corresponds to a document:
the title is recorded in the comment field, whereas the
URI is written with MutableString.writeSelfDelimUTF8(java.io.OutputStream)
directly to the zipped output stream. When building an exact
ZipDocumentCollection
subsequent word/nonword pairs are written in the same way, and
delimited by two empty strings. If the collection is not exact, just words are written,
and delimited by an empty string. Non-text fields are written directly to the zipped output stream
as serialised objects.
The collection will produce the same documents as the original sequence whence it was produced, in the following sense:
String.valueOf(int)),
followed by a pair of strings for each fragment (the first string being the document specifier,
and the second being the associated text);
The collection will be, as any other collection, serialized on a file, but it will refer to another
zip file that is going to contain the documents themselves. Please use AbstractDocumentSequence.load(CharSequence)
to load instances of this collection.
Note that the zip format is not designed for a large number of files. This class is mainly a useful example,
and a handy way to build quickly a collection containing all fields at indexing time. For a more efficient
kind of collection, see SimpleCompressedDocumentCollection.
Warning: the Reader returned by Document.content(int)
for documents produced by this factory is just obtained as the concatenation of words and non-words returned by
the word reader for that field. In case the collection is not exact, nonwords are substituted by a space.
| Nested Class Summary | |
|---|---|
static class |
ZipDocumentCollection.PropertyKeys
Symbolic names for common properties of a DocumentCollection. |
protected static class |
ZipDocumentCollection.ZipFactory
A factory tightly coupled to a ZipDocumentCollection. |
| Field Summary | |
|---|---|
static String |
ZIP_EXTENSION
|
| Fields inherited from interface it.unimi.di.mg4j.document.DocumentCollection |
|---|
DEFAULT_EXTENSION |
| Constructor Summary | |
|---|---|
ZipDocumentCollection(String zipFilename,
DocumentFactory underlyingFactory,
int numberOfDocuments,
boolean exact)
Constructs a document collection (for reading) corresponding to a given zip collection file. |
|
| Method Summary | |
|---|---|
void |
close()
Closes this document sequence, releasing all resources. |
ZipDocumentCollection |
copy()
|
Document |
document(int index)
Returns the document given its index. |
DocumentFactory |
factory()
Returns the factory used by this sequence. |
void |
filename(CharSequence filename)
Does nothing. |
DocumentIterator |
iterator()
Returns an iterator over the sequence of documents. |
Reference2ObjectMap<Enum<?>,Object> |
metadata(int index)
Returns the metadata map for a document. |
int |
size()
Returns the number of documents in this collection. |
InputStream |
stream(int index)
Returns an input stream for the raw content of a document. |
| Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentCollection |
|---|
ensureDocumentIndex, main, printAllDocuments, toString |
| Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentSequence |
|---|
finalize, load |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Field Detail |
|---|
public static final String ZIP_EXTENSION
| Constructor Detail |
|---|
public ZipDocumentCollection(String zipFilename,
DocumentFactory underlyingFactory,
int numberOfDocuments,
boolean exact)
zipFilename - the filename of the zip collection.underlyingFactory - the underlying document factory.numberOfDocuments - the number of documents.exact - true iff this is an exact reproduction of the original sequence.| Method Detail |
|---|
public void filename(CharSequence filename)
throws IOException
AbstractDocumentSequence
filename in interface DocumentSequencefilename in class AbstractDocumentSequencefilename - the filename of this document sequence.
IOExceptionpublic ZipDocumentCollection copy()
copy in interface DocumentCollectioncopy in interface FlyweightPrototype<DocumentCollection>public DocumentFactory factory()
DocumentSequenceEvery document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
factory in interface DocumentSequencepublic int size()
DocumentCollection
size in interface DocumentCollection
public Document document(int index)
throws IOException
DocumentCollection
document in interface DocumentCollectionindex - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
index-th document.
IOExceptionpublic Reference2ObjectMap<Enum<?>,Object> metadata(int index)
DocumentCollection
metadata in interface DocumentCollectionindex - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
public InputStream stream(int index)
throws IOException
DocumentCollection
stream in interface DocumentCollectionindex - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
IOExceptionpublic DocumentIterator iterator()
DocumentSequenceWarning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.
Implementations may decide to override this restriction
(in particular, if they implement DocumentCollection). Usually,
however, it is not possible to obtain two iterators at the
same time on a collection.
iterator in interface DocumentSequenceiterator in class AbstractDocumentCollectionDocumentCollection
public void close()
throws IOException
DocumentSequenceYou should always call this method after having finished with this document sequence.
Implementations are invited to call this method in a finaliser as a safety net (even better,
implement SafelyCloseable), but since there
is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.
close in interface DocumentSequenceclose in interface Closeableclose in class AbstractDocumentSequenceIOException
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||