java Jsoup - 提取文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10177867/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 23:56:47  来源:igfitidea点击:

Jsoup - extracting text

javaiterationjsouptext-extraction

提问by Eugene Retunsky

I need to extract text from a node like this:

我需要从这样的节点中提取文本:

<div>
    Some text <b>with tags</b> might go here.
    <p>Also there are paragraphs</p>
    More text can go without paragraphs<br/>
</div>

And I need to build:

我需要构建:

Some text <b>with tags</b> might go here.
Also there are paragraphs
More text can go without paragraphs

Element.textreturns just all content of the div. Element.ownText- everything that is not inside children elements. Both are wrong. Iterating through childrenignores text nodes.

Element.text只返回 div 的所有内容。Element.ownText- 不在子元素中的所有内容。两者都是错误的。遍历children忽略文本节点。

Is there are way to iterate contents of an element to receive text nodes as well. E.g.

有没有办法迭代元素的内容来接收文本节点。例如

  • Text node - Some text
  • Node <b> - with tags
  • Text node - might go here.
  • Node <p> - Also there are paragraphs
  • Text node - More text can go without paragraphs
  • Node <br> - <empty>
  • 文本节点 - 一些文本
  • 节点 <b> - 带标签
  • 文本节点 - 可能会去这里。
  • 节点 <p> - 还有段落
  • 文本节点 - 更多文本可以没有段落
  • 节点 <br> - <空>

回答by Vadim Ponomarev

Element.children()returns an Elementsobject - a list of Elementobjects. Looking at the parent class, Node, you'll see methods to give you access to arbitrary nodes, not just Elements, such as Node.childNodes().

Element.children()返回一个Elements对象 - 一个Element对象列表。查看父类Node,您将看到允许您访问任意节点的方法,而不仅仅是元素,例如Node.childNodes()

public static void main(String[] args) throws IOException {
    String str = "<div>" +
            "    Some text <b>with tags</b> might go here." +
            "    <p>Also there are paragraphs</p>" +
            "    More text can go without paragraphs<br/>" +
            "</div>";

    Document doc = Jsoup.parse(str);
    Element div = doc.select("div").first();
    int i = 0;

    for (Node node : div.childNodes()) {
        i++;
        System.out.println(String.format("%d %s %s",
                i,
                node.getClass().getSimpleName(),
                node.toString()));
    }
}

Result:

结果:

1 TextNode 
 Some text 
2 Element <b>with tags</b>
3 TextNode  might go here. 
4 Element <p>Also there are paragraphs</p>
5 TextNode  More text can go without paragraphs
6 Element <br/>

回答by Charles

for (Element el : doc.select("body").select("*")) {

        for (TextNode node : el.textNodes()) {

                    node.text() ));

        }

    }

回答by John Zoetebier

Assuming you want text only (no tags) my solution is below.
Output is:
Some text with tags might go here. Also there are paragraphs. More text can go without paragraphs

假设您只需要文本(无标签),我的解决方案如下。
输出是:
一些带有标签的文本可能会出现在这里。还有段落。更多的文字可以没有段落

public static void main(String[] args) throws IOException {
    String str = 
                "<div>"  
            +   "    Some text <b>with tags</b> might go here."
            +   "    <p>Also there are paragraphs.</p>"
            +   "    More text can go without paragraphs<br/>" 
            +   "</div>";

    Document doc = Jsoup.parse(str);
    Element div = doc.select("div").first();
    StringBuilder builder = new StringBuilder();
    stripTags(builder, div.childNodes());
    System.out.println("Text without tags: " + builder.toString());
}

/**
 * Strip tags from a List of type <code>Node</code>
 * @param builder StringBuilder : input and output
 * @param nodesList List of type <code>Node</code>
 */
public static void stripTags (StringBuilder builder, List<Node> nodesList) {

    for (Node node : nodesList) {
        String nodeName  = node.nodeName();

        if (nodeName.equalsIgnoreCase("#text")) {
            builder.append(node.toString());
        } else {
            // recurse
            stripTags(builder, node.childNodes());
        }
    }
}

回答by Haydar Ghasemi

you can use TextNode for this purpose:

为此,您可以使用 TextNode:

List<TextNode> bodyTextNode = doc.getElementById("content").textNodes();
    String html = "";
    for(TextNode txNode:bodyTextNode){
        html+=txNode.text();
    }