如何使用 Java 中的 Jsoup 从 javascript 变量解析 html?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17922129/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to parse html from javascript variables with Jsoup in Java?
提问by Caballero
I'm using Jsoup to parse html file and pull all the visible text from elements. The problem is that there are some html bits in javascript variables which are obviously ignored. What would be the best solution to get those bits out?
我正在使用 Jsoup 解析 html 文件并从元素中提取所有可见文本。问题是 javascript 变量中有一些 html 位显然被忽略了。取出这些位的最佳解决方案是什么?
Example:
例子:
<!DOCTYPE html>
<html>
<head>
<script>
var html = "<span>some text</span>";
</script>
</head>
<body>
<p>text</p>
</body>
</html>
In this example Jsoup only picks up the text from p
tag which is what it's supposed to do. How do I pick up the text from var html
span? The solution must be applied to thousands of different pages, so I can't rely on something like javascript variable having the same name.
在这个例子中,Jsoup 只从p
标签中提取文本,这是它应该做的。如何从var html
跨度中提取文本?该解决方案必须应用于数千个不同的页面,因此我不能依赖于具有相同名称的 javascript 变量之类的东西。
回答by Daniel B
You can use Jsoup to parse all the <script>
-tags into DataNode
-objects.
您可以使用 Jsoup 将所有<script>
-tags解析为DataNode
-objects。
DataNode
A data node, for contents of style, script tags etc, where contents should not show in text().
DataNode
一个数据节点,用于样式、脚本标签等的内容,其中内容不应显示在 text() 中。
Elements scriptTags = doc.getElementsByTag("script");
This will give you all the Elements of tag <script>
.
这将为您提供 tag 的所有元素<script>
。
You can then use the getWholeData()
-method to extract the node.
然后您可以使用getWholeData()
-method 来提取节点。
// Get the data contents of this node. String getWholeData()
// Get the data contents of this node. String getWholeData()
for (Element tag : scriptTags){
for (DataNode node : tag.dataNodes()) {
System.out.println(node.getWholeData());
}
}
回答by KK4SBB
I am not so sure about the answer, but I saw a similar situation before here.
我对答案不太确定,但我之前在这里看到过类似的情况。
You probably can use Jsoup and manual parsing to get the text according to that answer.
您可能可以使用 Jsoup 和手动解析来根据该答案获取文本。
I just modify that piece of code for your specific case:
我只是针对您的具体情况修改那段代码:
Document doc = ...
Element script = doc.select("script").first(); // Get the script part
Pattern p = Pattern.compile("(?is)html = \"(.+?)\""); // Regex for the value of the html
Matcher m = p.matcher(script.html()); // you have to use html here and NOT text! Text will drop the 'html' part
while( m.find() )
{
System.out.println(m.group()); // the whole html text
System.out.println(m.group(1)); // value only
}
Hope it will be helpful.
希望它会有所帮助。