Class WikipediaDocumentCollection
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.AbstractDocumentSequence
-
- it.unimi.di.big.mg4j.document.AbstractDocumentCollection
-
- it.unimi.di.big.mg4j.document.WikipediaDocumentCollection
-
- All Implemented Interfaces:
DocumentCollection,DocumentSequence,SafelyCloseable,FlyweightPrototype<DocumentCollection>,Closeable,Serializable,AutoCloseable
public class WikipediaDocumentCollection extends AbstractDocumentCollection implements Serializable
ADocumentCollectioncorresponding to a given set of files in the Yahoo! Wikipedia format.Warning: this class has no connection whatsoever with
WikipediaDocumentSequence.This class provides a main method with a flexible syntax that serialises into a document collection a list of (possibly gzip'd) files given on the command line or piped into standard input. The files are to be taken from the semantically annotated snapshot of the english wikipedia distributed by Yahoo!. The position of each record is stored using an
EliasFanoMonotoneLongBigListper file, which gives us random access with very little overhead.Each column of the collection is indexed in parallel, and is accessible using its label as field name. For instance, a query like
Washington ^ WSJ:(B\-E\:PERSON | B\-I\:PERSON)
will search for “Washington”, but only if the term has been annotated as a person name (note the escaping, which is necessary if you use the standard parser). See theit.unimi.di.big.mg4j.searchpackage for more info about the operators available.See the collection page for more information about the tagging process.
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classWikipediaDocumentCollection.WhitespaceWordReader-
Nested classes/interfaces inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
AbstractDocumentCollection.PropertyKeys
-
-
Field Summary
-
Fields inherited from interface it.unimi.di.big.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
-
-
Constructor Summary
Constructors Modifier Constructor Description WikipediaDocumentCollection(String[] file, DocumentFactory factory, boolean phrase)Builds a document collection corresponding to a given set of Wikipedia files specified as an array.WikipediaDocumentCollection(String[] file, DocumentFactory factory, boolean phrase, boolean gzipped)Builds a document collection corresponding to a given set of (possibly gzip'd) Wikipedia files specified as an array.protectedWikipediaDocumentCollection(String[] file, DocumentFactory factory, ObjectArrayList<EliasFanoMonotoneLongBigList> pointers, int size, long[] firstDocument, boolean phrase, boolean gzipped)
-
Method Summary
Modifier and Type Method Description WikipediaDocumentCollectioncopy()Documentdocument(long index)Returns the document given its index.DocumentFactoryfactory()Returns the factory used by this sequence.DocumentIteratoriterator()Returns an iterator over the sequence of documents.static voidmain(String[] arg)Reference2ObjectMap<Enum<?>,Object>metadata(long index)Returns the metadata map for a document.longsize()Returns the number of documents in this collection.InputStreamstream(long index)Returns an input stream for the raw content of a document.-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, printAllDocuments, toString
-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentSequence
close, filename, finalize, load
-
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface it.unimi.di.big.mg4j.document.DocumentSequence
close, filename
-
-
-
-
Constructor Detail
-
WikipediaDocumentCollection
public WikipediaDocumentCollection(String[] file, DocumentFactory factory, boolean phrase) throws IOException
Builds a document collection corresponding to a given set of Wikipedia files specified as an array.Beware. This class is not guaranteed to work if files are deleted or modified after creation!
- Parameters:
file- an array containing the files that will be contained in the collection.factory- the factory that will be used to create documents.phrase- whether phrases should be indexed instead of documents.- Throws:
IOException
-
WikipediaDocumentCollection
public WikipediaDocumentCollection(String[] file, DocumentFactory factory, boolean phrase, boolean gzipped) throws IOException
Builds a document collection corresponding to a given set of (possibly gzip'd) Wikipedia files specified as an array.Beware. This class is not guaranteed to work if files are deleted or modified after creation!
- Parameters:
file- an array containing the files that will be contained in the collection.factory- the factory that will be used to create documents.phrase- whether phrases should be indexed instead of documents.gzipped- the files infileare gzip'd.- Throws:
IOException
-
WikipediaDocumentCollection
protected WikipediaDocumentCollection(String[] file, DocumentFactory factory, ObjectArrayList<EliasFanoMonotoneLongBigList> pointers, int size, long[] firstDocument, boolean phrase, boolean gzipped)
-
-
Method Detail
-
factory
public DocumentFactory factory()
Description copied from interface:DocumentSequenceReturns the factory used by this sequence.Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.
- Specified by:
factoryin interfaceDocumentSequence- Returns:
- the factory used by this sequence.
-
size
public long size()
Description copied from interface:DocumentCollectionReturns the number of documents in this collection.- Specified by:
sizein interfaceDocumentCollection- Returns:
- the number of documents in this collection.
-
metadata
public Reference2ObjectMap<Enum<?>,Object> metadata(long index) throws IOException
Description copied from interface:DocumentCollectionReturns the metadata map for a document.- Specified by:
metadatain interfaceDocumentCollection- Parameters:
index- an index between 0 (inclusive) andDocumentCollection.size()(exclusive).- Returns:
- the metadata map for the document.
- Throws:
IOException
-
document
public Document document(long index) throws IOException
Description copied from interface:DocumentCollectionReturns the document given its index.- Specified by:
documentin interfaceDocumentCollection- Parameters:
index- an index between 0 (inclusive) andDocumentCollection.size()(exclusive).- Returns:
- the
index-th document. - Throws:
IOException
-
stream
public InputStream stream(long index) throws IOException
Description copied from interface:DocumentCollectionReturns an input stream for the raw content of a document.- Specified by:
streamin interfaceDocumentCollection- Parameters:
index- an index between 0 (inclusive) andDocumentCollection.size()(exclusive).- Returns:
- the raw content of the document as an input stream.
- Throws:
IOException
-
iterator
public DocumentIterator iterator() throws IOException
Description copied from interface:DocumentSequenceReturns an iterator over the sequence of documents.Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.
Implementations may decide to override this restriction (in particular, if they implement
DocumentCollection). Usually, however, it is not possible to obtain two iterators at the same time on a collection.- Specified by:
iteratorin interfaceDocumentSequence- Overrides:
iteratorin classAbstractDocumentCollection- Returns:
- an iterator over the sequence of documents.
- Throws:
IOException- See Also:
DocumentCollection
-
copy
public WikipediaDocumentCollection copy()
- Specified by:
copyin interfaceDocumentCollection- Specified by:
copyin interfaceFlyweightPrototype<DocumentCollection>
-
main
public static void main(String[] arg) throws IOException, com.martiansoftware.jsap.JSAPException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
- Throws:
IOExceptioncom.martiansoftware.jsap.JSAPExceptionInstantiationExceptionIllegalAccessExceptionInvocationTargetExceptionNoSuchMethodException
-
-