java XPath 无法通过 id 找到表

Question

提问by Dean Schulze

I'm doing some screen scraping using WATIJ, but it can't read HTML tables (throws NullPointerExceptions or UnknownObjectExceptions). To overcome this I read the HTML and run it through JTidy to get well-formed XML.

我正在使用 WATIJ 进行一些屏幕抓取，但它无法读取 HTML 表（抛出 NullPointerExceptions 或 UnknownObjectExceptions）。为了克服这个问题，我阅读了 HTML 并通过 JTidy 运行它以获得格式良好的 XML。

I want to parse it with XPath, but it can't find a <table ...>by ideven though the table is there in the XML plain as day. Here is my code:

我想使用XPath解析它，但它不能找到一个<table ...>通过id，即使该表是那里的XML纯如白昼。这是我的代码：

XPathFactory factory=XPathFactory.newInstance();  
XPath xPath=factory.newXPath();  
InputSource inputSource = new InputSource(new StringReader(tidyHtml));  
XPathExpression xPathExpression=xPath.compile("//table[@id='searchResult']");  
String expression = "//table[@id='searchResult']";
String table = xPath.evaluate(expression, inputSource);
System.out.println("table = " + table);

The table is an empty String.

该表是一个空字符串。

The table is in the XML, however. If I print the tidyHtmlString it shows

但是，该表位于 XML 中。如果我打印tidyHtml它显示的字符串

 <table
   class="ApptableDisplayTag"
   id="searchResult"
   style="WIDTH: 99%">

I haven't used XPath before so maybe I'm missing something.

我以前没有使用过 XPath，所以也许我遗漏了一些东西。

Can anyone set me straight? Thanks.

谁能让我直截了当？谢谢。

Answer 1

回答by Michael Cheng

I don't know anything about JTidy, but I for WATIJ, I believe the reason you are getting the NullPointer and UnknownObject Exceptions is because your XPATH is using lower cased nodes. So say you are using "//table[@id='searchResult']" as the xpath to lookup the table in WATIJ. That won't actually work because "table" is in lower case. For WATIJ, you need to have all the node names in upper case, eg: "//TABLE[@id='searchResult']". As an example, say you want to print the number of rows of that table using WATIJ, you'd do the following:

我对 JTidy 一无所知，但对于 WATIJ，我相信您收到 NullPointer 和 UnknownObject Exceptions 的原因是因为您的 XPATH 使用的是小写节点。因此，假设您使用“//table[@id='searchResult']”作为 xpath 在 WATIJ 中查找表。这实际上不起作用，因为“表”是小写的。对于 WATIJ，您需要使用大写的所有节点名称，例如：“//TABLE[@id='searchResult']”。例如，假设您想使用 WATIJ 打印该表的行数，您可以执行以下操作：

import watij.runtime.ie.IE;
import static watij.finders.SymbolFactory.*;

public class Example {
    public static void main(String[] args) {
        IE ie = new IE();
        ie.start("your_url_goes_here");
        System.out.println(ie.table(xpath, "//TABLE[@id='searchResult']").rowCount());
        ie.close();
    }
}

This code or answer may not be right since I've only started using WATIJ today. Though I did run into this same exact problem with xpaths. Took me a couple of hours of searching/testing before I noticed how all the xpaths were cased on this page: WATIJ User GuideOnce I changed the casing in my xpaths, WATIJ was able to locate the objects so this should work for you as well.

此代码或答案可能不正确，因为我今天才开始使用 WATIJ。尽管我确实遇到了与 xpaths 完全相同的问题。在我注意到此页面上所有 xpath 的大小写之前，我花了几个小时进行搜索/测试：WATIJ 用户指南一旦我更改了我的 xpath 中的大小写，WATIJ 就能够定位对象，因此这也适用于您.

Answer 2

回答by Dean Schulze

The solution was to drop WATIJ and switch to Google WebDriver. WebDriver documents how different browsers handle case in xpath statements.

解决方案是放弃 WATIJ 并切换到 Google WebDriver。WebDriver 记录了不同浏览器如何处理 xpath 语句中的大小写。

Answer 3

回答by user207421

Double quotes are definitely not required, and neither is uppercase. Namespaces and/or DTD are more likely the answer.

双引号绝对不是必需的，大写也不是。命名空间和/或 DTD 更有可能是答案。

Answer 4

回答by Philip

Uniue ID attributes need to be accessed by the id( ) method id('search')

Uniue ID 属性需要通过 id( ) 方法访问 id('search')

Answer 5

回答by potyl

I never used the XPath API of Java directly, I always used it through dom4jor in other languages (Perl and C). But I have a good understanding on how it works normally. At first you should probably parsed the input as a DOM document, this will greatly help. Also if you know that your document has ID you should parse it with loading the DTD or Schema that describes it this way the XML parser will mark and identify the nodes that have proper IDs. Once you have done this you can use your code with the DOM tree.

我从不直接使用 Java 的 XPath API，我总是通过dom4j或其他语言（Perl 和 C）使用它。但我对它的正常工作方式有很好的了解。起初，您可能应该将输入解析为 DOM 文档，这将有很大帮助。此外，如果您知道您的文档具有 ID，您应该通过加载描述它的 DTD 或架构来解析它，XML 解析器将标记并识别具有正确 ID 的节点。完成此操作后，您可以将代码与 DOM 树一起使用。

The documentation of [XPath.evaluate(expression, item)](http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/xpath/XPath.html#evaluate(java.lang.String,%20java.lang.Object)shows that the second element should be a Node or a NodeList. This probably why you're having plenty of UnknownObjectExceptions.

[XPath.evaluate(expression, item)]( http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/xpath/XPath.html#evaluate(java.lang. String,%20java.lang.Object)表明第二个元素应该是一个 Node 或一个 NodeList。这可能是你有很多 UnknownObjectExceptions 的原因。

If your XML parser is able to recognize the ID elements then you can access an element having an ID with the following XPath expression:

如果您的 XML 解析器能够识别 ID 元素，那么您可以使用以下 XPath 表达式访问具有 ID 的元素：

XPathExpression xPathExpression=xPath.compile("id('searchResult')");
xPathExpression.evaluate(document); // document is a DOM document instance

Using the XPath function id()is the most efficient way for accessing elements, that is when the elements are using an ID and have been declared in such way in the DTD or Schema.

使用 XPath 函数id()是访问元素的最有效方式，即当元素使用 ID 并且已在 DTD 或 Schema 中以这种方式声明时。

Answer 6

回答by Yevgeny Simkin

youe xPath is correct... whatever it is that's failing, it isn't that.

你的 xPath 是正确的......无论失败的是什么，都不是。

Answer 7

回答by Dean Schulze

It looks like the problem is mostly with JTidy. I can get xpath to parse the JTidy-ied result by doing the following:

看起来问题主要出在 JTidy 上。我可以通过执行以下操作让 xpath 解析 JTidy-ied 结果：

Remove all "<&amp>nbsp;". JTidy returns xhtml with "<&amp>nbsp;" outside of tags. Remove the In the tag remove the xmlns=... attribute Remove the "head" tags. (I usee some funny formatting because HTML entities won't display when typed properly)

删除所有“<&> nbsp;”。JTidy 返回带有“<&>nbsp;”的 xhtml 标签之外。删除在标记中删除 xmlns=... 属性删除“head”标记。（我使用了一些有趣的格式，因为正确键入时 HTML 实体不会显示）

JTidy also puts newlines in the middle of the text content if ... elements.

JTidy 还会在文本内容中间放置换行符 if ... 元素。

I'll have to look at other HTML -> XML conversion options. I gave Cobra a quick try, but it also failed to find my table by Id. I haven't tried manually cleaning up the result from Cobra, so I don't know how it compares to JTidy.

我将不得不查看其他 HTML -> XML 转换选项。我快速尝试了 Cobra，但它也无法通过 Id 找到我的桌子。我没有尝试手动清理 Cobra 的结果，所以我不知道它与 JTidy 相比如何。

If you know of an HTML parser that returns good XML please let me know.

如果您知道返回良好 XML 的 HTML 解析器，请告诉我。

java XPath 无法通过 id 找到表

提问by Dean Schulze

回答by Michael Cheng

回答by Dean Schulze

回答by user207421

回答by Philip

回答by potyl

回答by Yevgeny Simkin

回答by Dean Schulze

相关推荐

最近更新

标签

java XPath 无法通过 id 找到表

提问by Dean Schulze

回答by Michael Cheng

回答by Dean Schulze

回答by user207421

回答by Philip

回答by potyl

回答by Yevgeny Simkin

回答by Dean Schulze

相关推荐

如何使用密码和 Java 将 12 位十进制数字加密/解密为其他数字？

java 为什么 String.indexOf 不使用异常而是在未找到子字符串时返回 -1？

java 一种快速确定是否在 JPanel 中找到组件的方法

如何以 PHP 和 Rails 的风格获取 Java 数组的请求参数？

相关推荐

最近更新

标签