|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectit.unimi.di.mg4j.document.AbstractDocumentFactory
it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
it.unimi.di.mg4j.document.HtmlDocumentFactory
public class HtmlDocumentFactory
A factory that provides fields for body and title of HTML documents.
It uses internally a BulletParser.
A default encoding can be provided
using the property PropertyBasedDocumentFactory.MetadataKeys.ENCODING.
By default, the WordReader provided by this factory
is just a FastBufferedReader, but you can specify
an alternative word reader using the property
PropertyBasedDocumentFactory.MetadataKeys.WORDREADER.
| Nested Class Summary | |
|---|---|
protected class |
HtmlDocumentFactory.HtmlDocument
An HTML document. |
static class |
HtmlDocumentFactory.MetadataKeys
|
| Nested classes/interfaces inherited from interface it.unimi.di.mg4j.document.DocumentFactory |
|---|
DocumentFactory.FieldType |
| Field Summary |
|---|
| Fields inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory |
|---|
defaultMetadata |
| Constructor Summary | |
|---|---|
HtmlDocumentFactory()
|
|
HtmlDocumentFactory(Properties properties)
|
|
HtmlDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
|
|
HtmlDocumentFactory(String[] property)
|
|
| Method Summary | |
|---|---|
HtmlDocumentFactory |
copy()
Returns a copy of this document factory. |
int |
fieldIndex(String fieldName)
Returns the index of a field, given its symbolic name. |
String |
fieldName(int field)
Returns the symbolic name of a field. |
DocumentFactory.FieldType |
fieldType(int field)
Returns the type of a field. |
Document |
getDocument(InputStream rawContent,
Reference2ObjectMap<Enum<?>,Object> metadata)
Returns the document obtained by parsing the given byte stream. |
int |
numberOfFields()
Returns the number of fields present in the documents produced by this factory. |
protected boolean |
parseProperty(String key,
String[] values,
Reference2ObjectMap<Enum<?>,Object> metadata)
Parses a property with given key and value, adding it to the given map. |
| Methods inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory |
|---|
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey |
| Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentFactory |
|---|
ensureFieldIndex, toString |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Constructor Detail |
|---|
public HtmlDocumentFactory(Properties properties)
throws ConfigurationException
ConfigurationExceptionpublic HtmlDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
public HtmlDocumentFactory(String[] property)
throws ConfigurationException
ConfigurationExceptionpublic HtmlDocumentFactory()
| Method Detail |
|---|
protected boolean parseProperty(String key,
String[] values,
Reference2ObjectMap<Enum<?>,Object> metadata)
throws ConfigurationException
PropertyBasedDocumentFactoryCurrently this implementation just parses the PropertyBasedDocumentFactory.MetadataKeys.LOCALE property.
Subclasses should do their own parsing, returing true in case of success and
returning super.parseProperty() otherwise.
parseProperty in class PropertyBasedDocumentFactorykey - the property key.values - the property value; this is an array, because properties may have a list of comma-separated values.metadata - the metadata map.
ConfigurationExceptionpublic HtmlDocumentFactory copy()
public int numberOfFields()
DocumentFactory
public String fieldName(int field)
DocumentFactory
field - the index of a field (between 0 inclusive and DocumentFactory.numberOfFields() exclusive}).
field-th field.public int fieldIndex(String fieldName)
DocumentFactory
fieldName - the name of a field of this factory.
fieldName.public DocumentFactory.FieldType fieldType(int field)
DocumentFactoryThe possible types are defined in DocumentFactory.FieldType.
field - the index of a field (between 0 inclusive and DocumentFactory.numberOfFields() exclusive}).
field-th field.
public Document getDocument(InputStream rawContent,
Reference2ObjectMap<Enum<?>,Object> metadata)
throws IOException
DocumentFactoryThe parameter metadata actually replaces the lack of a simple keyword-based
parameter-passing system in Java. This method might take several different type of “suggestions”
which have been collected by the collection: typically, the document title, a URI representing
the document, its MIME type, its encoding and so on. Some of this information might be
set by default (as it happens, for instance, in a PropertyBasedDocumentFactory).
Implementations of this method must consult the metadata provided by the collection, possibly
complete them with default factory metadata, and proceed to the document construction.
rawContent - the raw content from which the document should be extracted; it must not be closed, as
resource management is a responsibility of the DocumentCollection.metadata - a map from enums (e.g., keys taken in PropertyBasedDocumentFactory) to various kind of objects.
IOException
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||