如何使用 Java 中的 Jsoup 从 javascript 变量解析 html？

Question

提问by Caballero

I'm using Jsoup to parse html file and pull all the visible text from elements. The problem is that there are some html bits in javascript variables which are obviously ignored. What would be the best solution to get those bits out?

我正在使用 Jsoup 解析 html 文件并从元素中提取所有可见文本。问题是 javascript 变量中有一些 html 位显然被忽略了。取出这些位的最佳解决方案是什么？

Example:

例子：

<!DOCTYPE html>
<html>
<head>
    <script>
        var html = "<span>some text</span>";
    </script>
</head>
<body>
    <p>text</p>
</body>
</html>

In this example Jsoup only picks up the text from ptag which is what it's supposed to do. How do I pick up the text from var htmlspan? The solution must be applied to thousands of different pages, so I can't rely on something like javascript variable having the same name.

在这个例子中，Jsoup 只从p标签中提取文本，这是它应该做的。如何从var html跨度中提取文本？该解决方案必须应用于数千个不同的页面，因此我不能依赖于具有相同名称的 javascript 变量之类的东西。

Answer 1

回答by Daniel B

You can use Jsoup to parse all the <script>-tags into DataNode-objects.

您可以使用 Jsoup 将所有<script>-tags解析为DataNode-objects。

DataNode
A data node, for contents of style, script tags etc, where contents should not show in text().

DataNode
一个数据节点，用于样式、脚本标签等的内容，其中内容不应显示在 text() 中。

 Elements scriptTags = doc.getElementsByTag("script");

This will give you all the Elements of tag <script>.

这将为您提供 tag 的所有元素<script>。

You can then use the getWholeData()-method to extract the node.

然后您可以使用getWholeData()-method 来提取节点。

// Get the data contents of this node.
String    getWholeData()

// Get the data contents of this node.
String    getWholeData()

 for (Element tag : scriptTags){                
        for (DataNode node : tag.dataNodes()) {
            System.out.println(node.getWholeData());
        }        
  }

Jsoup API - DataNode

Answer 2

回答by KK4SBB

I am not so sure about the answer, but I saw a similar situation before here.

我对答案不太确定，但我之前在这里看到过类似的情况。

You probably can use Jsoup and manual parsing to get the text according to that answer.

您可能可以使用 Jsoup 和手动解析来根据该答案获取文本。

I just modify that piece of code for your specific case:

我只是针对您的具体情况修改那段代码：

Document doc = ...
Element script = doc.select("script").first(); // Get the script part


Pattern p = Pattern.compile("(?is)html = \"(.+?)\""); // Regex for the value of the html
Matcher m = p.matcher(script.html()); // you have to use html here and NOT text! Text will drop the 'html' part


while( m.find() )
{
    System.out.println(m.group()); // the whole html text
    System.out.println(m.group(1)); // value only
}

Hope it will be helpful.

希望它会有所帮助。

如何使用 Java 中的 Jsoup 从 javascript 变量解析 html？

提问by Caballero

回答by Daniel B

回答by KK4SBB

相关推荐

最近更新

标签

如何使用 Java 中的 Jsoup 从 javascript 变量解析 html？

提问by Caballero

回答by Daniel B

回答by KK4SBB

相关推荐

javascript soundcloud 如何隐藏流媒体音频的 URL

javascript 可枚举是什么意思？

javascript 滚动后的Javascript粘性div

javascript 一键打开多个javascript弹出框

相关推荐

最近更新

标签