java 如何使用 Jsoup 搜索评论(“<!-- -->”)?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7541843/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to search for comments ("<!-- -->") using Jsoup?
提问by 87element
I would like to remove those tags with their content from source HTML.
我想从源 HTML 中删除那些带有内容的标签。
回答by dlamblin
When searching you basically use Elements.select(selector)
where selector
is defined by this API. However comments are not elements technically, so you may be confused here, still they are nodes identified by the node name #comment
.
搜索时,您基本上使用Elements.select(selector)
where selector
is defined by this API。然而,注释在技术上不是元素,所以你可能会在这里感到困惑,它们仍然是由节点名称标识的节点#comment
。
Let's see how that might work:
让我们看看它是如何工作的:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
public class RemoveComments {
public static void main(String... args) {
String h = "<html><head></head><body>" +
"<div><!-- foo --><p>bar<!-- baz --></div><!--qux--></body></html>";
Document doc = Jsoup.parse(h);
removeComments(doc);
doc.html(System.out);
}
private static void removeComments(Node node) {
for (int i = 0; i < node.childNodeSize();) {
Node child = node.childNode(i);
if (child.nodeName().equals("#comment"))
child.remove();
else {
removeComments(child);
i++;
}
}
}
}
回答by Michael Conrad
With JSoup 1.11+ (possibly older version) you can apply a filter:
使用 JSoup 1.11+(可能是旧版本),您可以应用过滤器:
private void removeComments(Element article) {
article.filter(new NodeFilter() {
@Override
public FilterResult tail(Node node, int depth) {
if (node instanceof Comment) {
return FilterResult.REMOVE;
}
return FilterResult.CONTINUE;
}
@Override
public FilterResult head(Node node, int depth) {
if (node instanceof Comment) {
return FilterResult.REMOVE;
}
return FilterResult.CONTINUE;
}
});
}
回答by byte mamba
reference @dlamblin https://stackoverflow.com/a/7541875/4712855this code get comment html
参考@dlamblin https://stackoverflow.com/a/7541875/4712855此代码获取评论html
public static void getHtmlComments(Node node) {
for (int i = 0; i < node.childNodeSize();i++) {
Node child = node.childNode(i);
if (child.nodeName().equals("#comment")) {
Comment comment = (Comment) child;
child.after(comment.getData());
child.remove();
}
else {
getHtmlComments(child);
}
}
}
回答by Feuerrabe
This is a variation of the first example using a functional programming approach. The easiest way to find all comments, which are immediate children of the current node is to use .filter()
on a stream of .childNodes()
这是使用函数式编程方法的第一个示例的变体。查找所有评论(当前节点的直接子节点)的最简单方法是.filter()
在.childNodes()
public void removeComments(Element e) {
e.childNodes().stream()
.filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
.forEach(n -> n.remove());
e.children().forEach(elem -> removeComments(elem));
}
Full example:
完整示例:
package demo;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.stream.Collectors;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Demo {
public static void removeComments(Element e) {
e.childNodes().stream()
.filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
.forEach(n -> n.remove());
e.children().forEach(elem -> removeComments(elem));
}
public static void main(String[] args) throws MalformedURLException, IOException {
Document doc = Jsoup.parse(new URL("https://en.wikipedia.org/"), 500);
// do not try this with JDK < 8
String userHome = System.getProperty("user.home");
PrintStream out = new PrintStream(new FileOutputStream(userHome + File.separator + "before.html"));
out.print(doc.outerHtml());
out.close();
removeComments(doc);
out = new PrintStream(new FileOutputStream(userHome + File.separator + "after.html"));
out.print(doc.outerHtml());
out.close();
}
}
}