java java如何从<div>标签中提取内容

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6026615/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 14:01:53  来源:igfitidea点击:

how to extract content from <div> tag java

javahtmlextract

提问by kyo21

i have a serious problem. i would like to extract the content from tag such as:

我有一个严重的问题。我想从标签中提取内容,例如:

<div class="main-content">
    <div class="sub-content">Sub content here</div>
      Main content here </div>

output i would expect is:

我期望的输出是:

Sub content here
Main content here

子内容在这里
主要内容在这里

i've tried using regex, but the result isn't so impressive. By using:

我试过使用正则表达式,但结果并不那么令人印象深刻。通过使用:

Pattern.compile("<div>(\S+)</div>");

would return all the strings before the first <*/div> tag
so, could anyone help me pls?

会返回第一个 <*/div> 标签之前的所有字符串,
所以有人可以帮我吗?

回答by MarcoS

I'd recommend avoiding regex for parsing HTML. You can easily do what you ask by using Jsoup:

我建议避免使用正则表达式来解析 HTML。您可以使用Jsoup轻松完成您的要求:

public static void main(String[] args) {
    String html = "<html><head/><body><div class=\"main-content\">" +
            "<div class=\"sub-content\">Sub content here</div>" +
            "Main content here </div></body></html>";
    Document document = Jsoup.parse(html);
    Elements divs = document.select("div");
    for (Element div : divs) {
        System.out.println(div.ownText());
    }
}


In response to comment: if you want to put the content of the divelements into an array of Strings you can simply do:

回应评论:如果您想将div元素的内容放入Strings的数组中,您可以简单地执行以下操作:

    String[] divsTexts = new String[divs.size()];
    for (int i = 0; i < divs.size(); i++) {
        divsTexts[i] = divs.get(i).ownText();
    }


In response to comment: if you have nested elements and you want to get own text for each element than you can use jquery multiple selector syntax. Here's an example:

回应评论:如果您有嵌套元素并且想要为每个元素获取自己的文本,则可以使用 jquery 多选择器语法。下面是一个例子:

public static void main(String[] args) {
    String html = "<html><head/><body><div class=\"main-content\">" +
            "<div class=\"sub-content\">" +
            "<p>a paragraph <b>with some bold text</b></p>" +
            "Sub content here</div>" +
            "Main content here </div></body></html>";
    Document document = Jsoup.parse(html);
    Elements divs = document.select("div, p, b");
    for (Element div : divs) {
        System.out.println(div.ownText());
    }
}

The code above will parse the following HTML:

上面的代码将解析以下 HTML:

<html>
<head />
<body>
<div class="main-content">
<div class="sub-content">
<p>a paragraph <b>with some bold text</b></p>
Sub content here</div>
Main content here</div>
</body>
</html>

and print the following output:

并打印以下输出:

Main content here
Sub content here
a paragraph
with some bold text

回答by Ankit

<div class="main-content" id="mainCon">
    <div class="sub-content" id="subCon">Sub content here</div>
 Main content here </div>

From this code if you want to get the result you have mentioned

如果您想获得您提到的结果,请从此代码

Use document.getElementById("mainCon").innerHTMLit will give Main content herealong with sub div but you parse that thing.

使用document.getElementById("mainCon").innerHTML它会在此处提供Main 内容和 sub div,但您会解析该内容。

And similarly for sub-div you can use the above code sniplet i.e. document.getElementById("subCon").innerHTML

同样对于 sub-div,您可以使用上面的代码片段,即 document.getElementById("subCon").innerHTML