Package it.unimi.di.big.mg4j.document
Class HtmlDocumentFactory.HtmlDocument
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.AbstractDocument
-
- it.unimi.di.big.mg4j.document.HtmlDocumentFactory.HtmlDocument
-
- All Implemented Interfaces:
Document,SafelyCloseable,Closeable,AutoCloseable
- Enclosing class:
- HtmlDocumentFactory
protected class HtmlDocumentFactory.HtmlDocument extends AbstractDocument
An HTML document. If a TITLE element is available, it will be used fortitle()instead of the default value.We delay the actual parsing until it is actually necessary, so operations like getting the document URI will not require parsing.
-
-
Field Summary
Fields Modifier and Type Field Description protected Reference2ObjectMap<Enum<?>,Object>metadataprotected booleanparsedWhether we already parsed the document.protected InputStreamrawContentThe cached raw content.
-
Constructor Summary
Constructors Modifier Constructor Description protectedHtmlDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata)
-
Method Summary
Modifier and Type Method Description Objectcontent(int field)Returns the content of the given field.protected voidensureParsed()CharSequencetitle()The title of this document.StringtoString()CharSequenceuri()A URI that is associated with this document.WordReaderwordReader(int field)Returns a word reader for the givenDocumentFactory.FieldType.TEXTfield.-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocument
close, finalize
-
-
-
-
Field Detail
-
metadata
protected final Reference2ObjectMap<Enum<?>,Object> metadata
-
parsed
protected boolean parsed
Whether we already parsed the document.
-
rawContent
protected final InputStream rawContent
The cached raw content.
-
-
Constructor Detail
-
HtmlDocument
protected HtmlDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata)
-
-
Method Detail
-
ensureParsed
protected void ensureParsed() throws IOException- Throws:
IOException
-
title
public CharSequence title()
Description copied from interface:DocumentThe title of this document.- Returns:
- the title to be used to refer to this document.
-
toString
public String toString()
- Overrides:
toStringin classAbstractDocument
-
uri
public CharSequence uri()
Description copied from interface:DocumentA URI that is associated with this document.- Returns:
- the URI associated with this document, or
null.
-
content
public Object content(int field) throws IOException
Description copied from interface:DocumentReturns the content of the given field.- Parameters:
field- the field index.- Returns:
- the field content; the actual type depends on the field type, as specified by the
DocumentFactorythat built this document. For example, the returned object is going to be aReaderif the field type isDocumentFactory.FieldType.TEXT. - Throws:
IOException
-
wordReader
public WordReader wordReader(int field)
Description copied from interface:DocumentReturns a word reader for the givenDocumentFactory.FieldType.TEXTfield.- Parameters:
field- the field index.- Returns:
- a word reader object that should be used to break the given field.
-
-