Package org.htmlcleaner
Class HtmlCleaner
java.lang.Object
org.htmlcleaner.HtmlCleaner
Main HtmlCleaner class.
// create an instance of HtmlCleaner
HtmlCleaner cleaner = new HtmlCleaner();
// take default cleaner properties
CleanerProperties props = cleaner.getProperties();
// customize cleaner's behavior with property setters
props.setXXX(...);
// Clean HTML taken from simple string, file, URL, input stream,
// input source or reader. Result is root node of created
// tree-like structure. Single cleaner instance may be safely used
// multiple times.
TagNode node = cleaner.clean(...);
// optionally find parts of the DOM or modify some nodes
TagNode[] myNodes = node.getElementsByXXX(...);
// and/or
Object[] myNodes = node.evaluateXPath(xPathExpression);
// and/or
aNode.removeFromTree();
// and/or
aNode.addAttribute(attName, attValue);
// and/or
aNode.removeAttribute(attName, attValue);
// and/or
cleaner.setInnerHtml(aNode, htmlContent);
// and/or do some other tree manipulation/traversal
// serialize a node to a file, output stream, DOM, JDom...
new XXXSerializer(props).writeXmlXXX(aNode, ...);
myJDom = new JDomSerializer(props, true).createJDom(aNode);
myDom = new DomSerializer(props, true).createDOM(aNode);
It represents public interface to the user. It's task is to call tokenizer with specified source HTML, traverse list of produced token list and create internal object model. It also offers a set of methods to write resulting XML to string, file or any output stream.
Typical usage is the following:
-
Field Summary
Fields -
Constructor Summary
ConstructorsConstructorDescriptionConstructor - creates cleaner instance with default tag info provider,default version and default properties.HtmlCleaner
(CleanerProperties properties) Constructor - creates the instance with default tag info provider and specified propertiesHtmlCleaner
(ITagInfoProvider tagInfoProvider) Constructor - creates the instance with specified tag info provider and default propertiesHtmlCleaner
(ITagInfoProvider tagInfoProvider, CleanerProperties properties) Constructor - creates the instance with specified tag info provider and specified properties -
Method Summary
Modifier and TypeMethodDescriptionprotected void
addPruneNode
(TagNode node, org.htmlcleaner.CleanTimeValues cleanTimeValues) clean
(InputStream in) clean
(InputStream in, String charset) protected TagNode
Basic version of the cleaning call.Deprecated.Deprecated.protected Set<ITagNodeCondition>
getAllowTagSet
(org.htmlcleaner.CleanTimeValues cleanTimeValues) getAllTags
(org.htmlcleaner.CleanTimeValues cleanTimeValues) getInnerHtml
(TagNode node) For the specified node, returns it's content as string.protected Set<ITagNodeCondition>
getPruneTagSet
(org.htmlcleaner.CleanTimeValues cleanTimeValues) getTagInfo
(String tagName, org.htmlcleaner.CleanTimeValues cleanTimeValues) Returns a TagInfo object for the specified tag name.protected void
Called whenever the thread is interrupted.void
initCleanerTransformations
(Map transInfos) protected boolean
isRemovingNodeReasonablySafe
(TagNode startTagToken) void
setInnerHtml
(TagNode node, String content) For the specified tag node, defines it's html content.
-
Field Details
-
HTML_4
public static int HTML_4 -
HTML_5
public static int HTML_5
-
-
Constructor Details
-
HtmlCleaner
public HtmlCleaner()Constructor - creates cleaner instance with default tag info provider,default version and default properties. -
HtmlCleaner
Constructor - creates the instance with specified tag info provider and default properties- Parameters:
tagInfoProvider
- Provider for tag filtering and balancing
-
HtmlCleaner
Constructor - creates the instance with default tag info provider and specified properties- Parameters:
properties
- Properties used during parsing and serializing
-
HtmlCleaner
Constructor - creates the instance with specified tag info provider and specified properties- Parameters:
tagInfoProvider
- Provider for tag filtering and balancingproperties
- Properties used during parsing and serializing
-
-
Method Details
-
clean
-
clean
- Throws:
IOException
-
clean
- Throws:
IOException
-
clean
Deprecated.Deprecated because unmanaged network IO does not handle proxies, slow servers or broken connections well. the htmlcleaner caller should be managing the connections themselves and just providing the htmlcleaner library with a stream.- Parameters:
url
-charset
-- Returns:
- Throws:
IOException
-
clean
Deprecated.Creates instance from the content downloaded from specified URL. HTML encoding is resolved following the attempts in the sequence: 1. reading Content-Type response header, 2. Analyzing META tags at the beginning of the html, 3. Using platform's default charset.- Parameters:
url
-- Returns:
- Throws:
IOException
-
clean
- Throws:
IOException
-
clean
- Throws:
IOException
-
clean
- Throws:
IOException
-
clean
protected TagNode clean(Reader reader, org.htmlcleaner.CleanTimeValues cleanTimeValues) throws IOException Basic version of the cleaning call.- Parameters:
reader
- (not closed)- Returns:
- An instance of TagNode object which is the root of the XML tree.
- Throws:
IOException
-
isRemovingNodeReasonablySafe
- Parameters:
startTagToken
-- Returns:
- true if no id attribute or class attribute
-
getProperties
-
getPruneTagSet
-
getAllowTagSet
-
addPruneNode
-
getTagInfo
Returns a TagInfo object for the specified tag name. If the tag is foreign markup, we leave it as null. This is because we may get name clashes, e.g. svg:title. However, we do handle the tag if its embedded content within the correct NS (e.g. SVG, MathML)- Parameters:
tagName
-cleanTimeValues
-- Returns:
- a TagInfo object, or null if no matching TagInfo is found
-
getAllTags
-
getTagInfoProvider
- Returns:
- ITagInfoProvider instance for this HtmlCleaner
-
getTransformations
- Returns:
- Transformations defined for this instance of cleaner
-
getInnerHtml
For the specified node, returns it's content as string.- Parameters:
node
-- Returns:
- node's content as string
-
setInnerHtml
For the specified tag node, defines it's html content. This causes cleaner to reclean given html portion and insert it inside the node instead of previous content.- Parameters:
node
-content
-
-
initCleanerTransformations
- Parameters:
transInfos
-
-
handleInterruption
protected void handleInterruption()Called whenever the thread is interrupted. Currently this is a placeholder, but could hold cleanup methods and user interaction
-