Class AnchorExtractor
- java.lang.Object
-
- it.unimi.dsi.parser.callback.DefaultCallback
-
- it.unimi.di.big.mg4j.util.parser.callback.AnchorExtractor
-
- All Implemented Interfaces:
Callback
public class AnchorExtractor extends DefaultCallback
A callback extracting anchor text. When instantiating the extractor, you can specify the number of characters to be considered before the anchor, after the anchor or during the anchor (just the first characters are taken into consideration in the last two characters, and just the last ones in the first case).At the end of parsing, the result (the list of anchors) is available in
anchors, whose elements provide the content of the href attribute the text of the anchor and around the anchor; text is however modified so that fragment of words at the beginning of the pre-anchor context, or at the end of the post-anchor context, are cut away.For example, a fragment like:
...foo fOO FOO FOO ANCHOR TEXT BAR BAR BAr bar...(where the uppercase part represents the pre- and post-anchor context) generates the elementAnchor("xxx", "FOO FOO ANCHOR TEXT BAR BAR")
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classAnchorExtractor.AnchorA class representing an anchor.
-
Field Summary
Fields Modifier and Type Field Description ObjectList<AnchorExtractor.Anchor>anchorsThe resulting list of anchors.static booleanDEBUGstatic LoggerLOGGER-
Fields inherited from interface it.unimi.dsi.parser.callback.Callback
EMPTY_CALLBACK_ARRAY
-
-
Constructor Summary
Constructors Constructor Description AnchorExtractor(int maxPreAnchor, int maxAnchor, int maxPostAnchor)Creates a new anchor extractor.AnchorExtractor(int maxPreAnchor, int maxAnchor, int maxPostAnchor, String delimiter)Creates a new anchor extractor.
-
Method Summary
Modifier and Type Method Description booleancharacters(char[] characters, int offset, int length, boolean flowBroken)voidconfigure(BulletParser parser)voidendDocument()booleanendElement(Element element)voidstartDocument()booleanstartElement(Element element, Map<Attribute,MutableString> attrMap)-
Methods inherited from class it.unimi.dsi.parser.callback.DefaultCallback
cdata, getInstance
-
-
-
-
Field Detail
-
LOGGER
public static final Logger LOGGER
-
DEBUG
public static final boolean DEBUG
- See Also:
- Constant Field Values
-
anchors
public final ObjectList<AnchorExtractor.Anchor> anchors
The resulting list of anchors.
-
-
Constructor Detail
-
AnchorExtractor
public AnchorExtractor(int maxPreAnchor, int maxAnchor, int maxPostAnchor)Creates a new anchor extractor.- Parameters:
maxPreAnchor- maximum number of characters before an anchor.maxAnchor- maximum number of characters in an anchor.maxPostAnchor- maximum number of characters after an anchor.
-
AnchorExtractor
public AnchorExtractor(int maxPreAnchor, int maxAnchor, int maxPostAnchor, String delimiter)Creates a new anchor extractor.- Parameters:
maxPreAnchor- maximum number of characters before an anchor.maxAnchor- maximum number of characters in an anchor.maxPostAnchor- maximum number of characters after an anchor.delimiter- a token that will be inserted to delimit the anchor text, ornullfor no delimiter.
-
-
Method Detail
-
configure
public void configure(BulletParser parser)
- Specified by:
configurein interfaceCallback- Overrides:
configurein classDefaultCallback
-
startDocument
public void startDocument()
- Specified by:
startDocumentin interfaceCallback- Overrides:
startDocumentin classDefaultCallback
-
endDocument
public void endDocument()
- Specified by:
endDocumentin interfaceCallback- Overrides:
endDocumentin classDefaultCallback
-
startElement
public boolean startElement(Element element, Map<Attribute,MutableString> attrMap)
- Specified by:
startElementin interfaceCallback- Overrides:
startElementin classDefaultCallback
-
endElement
public boolean endElement(Element element)
- Specified by:
endElementin interfaceCallback- Overrides:
endElementin classDefaultCallback
-
characters
public boolean characters(char[] characters, int offset, int length, boolean flowBroken)- Specified by:
charactersin interfaceCallback- Overrides:
charactersin classDefaultCallback
-
-