自动生成 HTTP 屏幕抓取 Java 代码

Question

提问by Dónal

I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxyto log the corresponding HTTP calls.

我需要从网站上屏幕抓取一些数据，因为它们无法通过他们的网络服务获得。当我以前需要这样做时，我自己使用 Apache 的 HTTP 客户端库编写了 Java 代码，以进行相关的 HTTP 调用来下载数据。我通过在浏览器中点击相关屏幕，同时使用Charles Web 代理记录相应的 HTTP 调用，找出了我需要进行的相关调用。

As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.

正如您可以想象的那样，这是一个相当乏味的过程，我想知道是否有一种工具可以实际生成与浏览器会话相对应的 Java 代码。我希望生成的代码不会像手动编写的代码那么漂亮，但我可以在之后整理它。有谁知道这样的工具是否存在？Selenium 是我知道的一种可能性，但我不确定它是否支持这个确切的用例。

Thanks, Don

谢谢，唐

Answer 1

回答by j pimmel

I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.

我还会为 HtmlUnit 添加 +1，因为它的功能非常强大：如果您需要“就像真正的浏览器正在抓取和使用页面一样”的行为，那绝对是可用的最佳选择。HtmlUnit 执行（如果你愿意的话）页面中的 Javascript。

It currently has full featured supportfor all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.

它目前对所有主要的 Javascript 库都有全功能支持，并将使用它们执行 JS 代码。与此相对应，您可以在测试中以编程方式获取页面中 Javascript 对象的句柄。

If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTMLshould suffice. Its similar to JDomgiving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClientto retrieve pages.

然而，如果您尝试做的范围更小，更多的是阅读一些 HTML 元素，并且您不太关心 Javascript，那么使用NekoHTML就足够了。它类似于JDom提供对树的编程访问而不是 XPath。您可能需要使用 Apache 的HttpClient来检索页面。

Answer 2

回答by Nicholas

The manageability.orgblog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.

该manageability.org博客已经列出的网页一大堆刮工具为Java的条目。但是，我现在似乎无法访问它，但我确实在此处的Google 缓存中找到了纯文本表示。

Answer 3

回答by Marc Novakowski

You should take a look at HtmlUnit- it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.

您应该看看HtmlUnit- 它是为测试网站而设计的，但非常适合屏幕抓取和浏览多个页面。它负责处理 cookie 和其他与会话相关的东西。

Answer 4

回答by Sumit Ghosh

I would say I personally like to use HtmlUnit and Selenium as my 2 favorite tools for Screen Scraping.

我会说我个人喜欢使用 HtmlUnit 和 Selenium 作为我最喜欢的两个屏幕抓取工具。

Answer 5

回答by laz

A tool called The Grinderallows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).

一个名为The Grinder 的工具允许您通过代理来编写一个站点的会话脚本。输出是 Python（可在 Jython 中运行）。

自动生成 HTTP 屏幕抓取 Java 代码

提问by Dónal

回答by j pimmel

回答by Nicholas

回答by Marc Novakowski

回答by Sumit Ghosh

回答by laz

相关推荐

最近更新

标签

自动生成 HTTP 屏幕抓取 Java 代码

提问by Dónal

回答by j pimmel

回答by Nicholas

回答by Marc Novakowski

回答by Sumit Ghosh

回答by laz

相关推荐

使用 Java Graphics.drawString 替换的完全理由？

Java 中的“动态”转换

Java 对象重用

java 如何在用户键入时获取 JTextField 内容的长度？

相关推荐

最近更新

标签