java java如何从<div>标签中提取内容
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6026615/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to extract content from <div> tag java
提问by kyo21
i have a serious problem. i would like to extract the content from tag such as:
我有一个严重的问题。我想从标签中提取内容,例如:
<div class="main-content">
<div class="sub-content">Sub content here</div>
Main content here </div>
output i would expect is:
我期望的输出是:
Sub content here
Main content here
子内容在这里
主要内容在这里
i've tried using regex, but the result isn't so impressive. By using:
我试过使用正则表达式,但结果并不那么令人印象深刻。通过使用:
Pattern.compile("<div>(\S+)</div>");
would return all the strings before the first <*/div> tag
so, could anyone help me pls?
会返回第一个 <*/div> 标签之前的所有字符串,
所以有人可以帮我吗?
回答by MarcoS
I'd recommend avoiding regex for parsing HTML. You can easily do what you ask by using Jsoup:
我建议避免使用正则表达式来解析 HTML。您可以使用Jsoup轻松完成您的要求:
public static void main(String[] args) {
String html = "<html><head/><body><div class=\"main-content\">" +
"<div class=\"sub-content\">Sub content here</div>" +
"Main content here </div></body></html>";
Document document = Jsoup.parse(html);
Elements divs = document.select("div");
for (Element div : divs) {
System.out.println(div.ownText());
}
}
In response to comment: if you want to put the content of the div
elements into an array of String
s you can simply do:
回应评论:如果您想将div
元素的内容放入String
s的数组中,您可以简单地执行以下操作:
String[] divsTexts = new String[divs.size()];
for (int i = 0; i < divs.size(); i++) {
divsTexts[i] = divs.get(i).ownText();
}
In response to comment: if you have nested elements and you want to get own text for each element than you can use jquery multiple selector syntax. Here's an example:
回应评论:如果您有嵌套元素并且想要为每个元素获取自己的文本,则可以使用 jquery 多选择器语法。下面是一个例子:
public static void main(String[] args) {
String html = "<html><head/><body><div class=\"main-content\">" +
"<div class=\"sub-content\">" +
"<p>a paragraph <b>with some bold text</b></p>" +
"Sub content here</div>" +
"Main content here </div></body></html>";
Document document = Jsoup.parse(html);
Elements divs = document.select("div, p, b");
for (Element div : divs) {
System.out.println(div.ownText());
}
}
The code above will parse the following HTML:
上面的代码将解析以下 HTML:
<html>
<head />
<body>
<div class="main-content">
<div class="sub-content">
<p>a paragraph <b>with some bold text</b></p>
Sub content here</div>
Main content here</div>
</body>
</html>
and print the following output:
并打印以下输出:
Main content here
Sub content here
a paragraph
with some bold text
回答by Ankit
<div class="main-content" id="mainCon">
<div class="sub-content" id="subCon">Sub content here</div>
Main content here </div>
From this code if you want to get the result you have mentioned
如果您想获得您提到的结果,请从此代码
Use document.getElementById("mainCon").innerHTML
it will give Main content herealong with sub div but you parse that thing.
使用document.getElementById("mainCon").innerHTML
它会在此处提供Main 内容和 sub div,但您会解析该内容。
And similarly for sub-div you can use the above code sniplet i.e. document.getElementById("subCon").innerHTML
同样对于 sub-div,您可以使用上面的代码片段,即 document.getElementById("subCon").innerHTML