Java jSoup 从 <span> 类中获取文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9728854/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-16 06:48:58  来源:igfitidea点击:

jSoup to get text from <span> class

javaparsingjsoup

提问by Jo S.

I have a part of the HTML file with the following format:

我有以下格式的 HTML 文件的一部分:

<h6 class="uiStreamMessage" data-ft="_____"> 
   <span class="messageBody" data-ft="____"> Welcome
   </span>
</h6>

In the file, there are other span classes. But I would like to get the text for ALL 'messageBody' span only, which will be inserted into the database.

在该文件中,还有其他跨度类。但我只想获取所有“messageBody”跨度的文本,该文本将被插入到数据库中。

I've tried:

我试过了:

Elements links = doc.select("span.messageBody");
for (Element link : links) {
     message = link.text();
     // codes to insert into DB
}

and even

乃至

Elements links = doc.select("h6.uiStreamMessage span.messageBody");

Both doesn't work. I couldn't find any solutions from elsewhere. Please kindly help.

两者都不起作用。我无法从其他地方找到任何解决方案。请帮助。

**EDIT

**编辑

I've realised it's a nested span within the html file:

我意识到它是 html 文件中的嵌套跨度:

<h6 class="uiStreamMessage" data-ft=""> 
   <span class="messageBody" data-ft="">Twisted<a href="http://"><span>http://</span>
   <span class="word_break"></span>www.tb.net/</a> Balloons
   </span>
</h6>

And it's only at times there is another span within the 'messageBody' span. How do I get ALL the text within the 'messageBody' span?

并且只是有时在“messageBody”范围内还有另一个范围。如何获取“messageBody”范围内的所有文本?

采纳答案by Rodri_gore

 String html = "<h6 class='uiStreamMessage' data-ft=''><span class='messageBody' data-ft=''>Twisted<a href='http://'><span>http://</span><span class='word_break'></span>www.tb.net/</a> Balloons</span></h6>";
 Document doc = Jsoup.parse(html);
 Elements elements = doc.select("h6.uiStreamMessage > span.messageBody");
 for (Element e : elements) {
      System.out.println("All text:" + e.text());
      System.out.println("Only messageBody text:" + e.ownText());
}

For the facebook page https://www.facebook.com/pages/The-Nanyang-Chronicle/141387533074:

对于 facebook 页面https://www.facebook.com/pages/The-Nanyang-Chronicle/141387533074

try {
        Document doc = Jsoup.connect("https://www.facebook.com/pages/The-Nanyang-Chronicle/141387533074").timeout(0).get();

        Elements elements = doc.select("code.hidden_elem");
        for (Element e : elements) {
            String eHtml = e.html().replace("<!--", "").replace("-->", "");
            Document eWIthoutComment = Jsoup.parse(eHtml);
            Elements elem = eWIthoutComment.select("h6.uiStreamMessage >span.messageBody");
            for (Element eb : elem) {
                System.out.println(eb.text());                   
            }
        }
    } catch (IOException ex) {
        System.err.println("Error:" + ex.getMessage());
    }

回答by B. Anderson

Not sure why it's not working for you. Here is my code. It prints Welcometo the console.

不知道为什么它不适合你。这是我的代码。它打印Welcome到控制台。

String html = "<h6 class=\"uiStreamMessage\" data-ft=\"_____\">" + 
    "<span class=\"messageBody\" data-ft=\"____\"> Welcome</span>" +
    "</h6>";

Document doc = Jsoup.parse(html);
for (Element e : doc.select("span.messageBody")) {
    System.out.println(e.text());
}

This is essentially the same code you have, so there must be something else at play here.

这基本上与您拥有的代码相同,因此这里肯定有其他东西在起作用。