使用 Java 从 HTML 页面抓取数据,输出到数据库

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2471049/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 21:21:04  来源:igfitidea点击:

Scrape data from HTML pages using Java, output to database

javascraper

提问by Tanith

I need to know how to create a scraper (in Java) to gather data from HTML pages and output to a database...do not have a clue where to start so any information you can give me on this would be great. Also, you can't be too basic or simple here...thanks :)

我需要知道如何创建一个抓取工具(在 Java 中)来从 HTML 页面收集数据并输出到数据库......不知道从哪里开始,所以你能给我的任何信息都会很棒。另外,这里不能太基本或太简单...谢谢:)

回答by codaddict

First you need to get familiar with a HTMLDOMparser in Java like JTidy. This will help you to extract the stuff you want from a HTMLfile. Once you have the essential stuff, you can use JDBCto put in the database.

首先,您需要熟悉HTMLDOMJava 中的解析器,例如JTidy。这将帮助您从HTML文件中提取所需的内容。一旦你有了必要的东西,你就可以JDBC用来放入database.

It might be tempting to use regular expression for this job. But don't. HTML is not a regular language so regex are not the way to go.

为这项工作使用正则表达式可能很诱人。但是不要。HTML 不是常规语言,因此正则表达式不是要走的路。

回答by The Don

I am running a scraper using JSoup I'm a noob yet found it to be very intuitive and easy to work with. It is also capable of parsing a wide range or sources html, XML, RSS, etc.

我正在使用 JSoup 运行刮刀我是一个菜鸟,但发现它非常直观且易于使用。它还能够解析广泛的或来源的 html、XML、RSS 等。

I experimented with htmlunit with little to no success.

我尝试了 htmlunit,但几乎没有成功。

回答by mickthompson

A HUGE percentage of websites are build on malformed HTML code.
It is essential that you use something like HtmlCleanerto clean up the source code that you want to parse.
Then you can successfully use XPath to extract Nodes and Regex to parse specific part of the strings you extracted from the page.

很大一部分网站建立在格式错误的 HTML 代码上。
您必须使用HtmlCleaner 之类的工具来清理要解析的源代码。
然后你就可以成功地使用XPath 提取节点和正则表达式来解析你从页面中提取的字符串的特定部分。

At least this is the technique I used.

至少这是我使用的技术。

You can use the xHtml that is returned from HtmlCleaner as a sort of Interface between your Application and the remote Page you're trying to parse. You should test against this and in the case the remote page changes you just have to extract the new xHtml cleaned by HtmlCleaner, re-adapt the XPath Queries to extract what you need and re-test your Application code against the new Interface.

您可以使用从 HtmlCleaner 返回的 xHtml 作为您的应用程序和您尝试解析的远程页面之间的一种接口。您应该对此进行测试,并且在远程页面更改的情况下,您只需提取由 HtmlCleaner 清理的新 xHtml,重新调整 XPath 查询以提取您需要的内容,并针对新接口重新测试您的应用程序代码。

In the case you want to create a MultiThreaded 'scraper' be aware that HtmlCleaner is not Thread Safe (refer my post here).
This postcan give you an idea of how to parse a correctly formatted xHtml using XPath.
Good Luck! ;)

如果您想创建一个 MultiThreaded 'scraper',请注意 HtmlCleaner 不是线程安全的(请参阅我在此处的帖子)。
这篇文章可以让您了解如何使用 XPath 解析格式正确的 xHtml。
祝你好运!;)

note: at the time I implemented my Scraper, HtmlCleaner did a better job in normalizing the pages I wanted to parse. In some cases jTidy was failing in doing the same job so I'd suggest you to give it a try

注意:在我实现我的 Scraper 时,HtmlCleaner 在规范化我想要解析的页面方面做得更好。在某些情况下,jTidy 在做同样的工作时失败了,所以我建议你试一试

回答by Stefan De Boey

i successfully used lobo browser APIin a project that scraped HTML pages. the lobo browser project offers a browser but you can also use the API behind it very easily. it will also execute javascript and if that javascript manipulates the DOM, then that will also be reflected in the DOM when you investigate the DOM. so, in short, the API allows you mimic a browser, you can also work with cookies and stuff.

我在一个抓取 HTML 页面的项目中成功使用了lobo 浏览器 API。lobo 浏览器项目提供了一个浏览器,但您也可以非常轻松地使用它背后的 API。它还将执行 javascript,如果该 javascript 操作 DOM,那么当您调查 DOM 时,它也会反映在 DOM 中。所以,简而言之,API 允许您模仿浏览器,您还可以使用 cookie 和其他东西。

now for getting the data out of the HTML, i would first transform the HTML to valid XHTML. you can use jtidy for this. since XHTML is valid XML, you can use XPath to retrieve the data you want very easily. if you try to write code that parses the data from the raw HTML, your code will become a mess quickly. therefore i'd use XPath.

现在为了从 HTML 中获取数据,我首先将 HTML 转换为有效的 XHTML。您可以为此使用 jtidy。由于 XHTML 是有效的 XML,因此您可以使用 XPath 非常轻松地检索所需的数据。如果您尝试编写从原始 HTML 中解析数据的代码,您的代码将很快变得一团糟。因此我会使用 XPath。

Once you have the data, you can insert it into a DB with JDBCor maybe use Hibernate if you want to avoid writing too much SQL

获得数据后,您可以使用JDBC将其插入到数据库中,或者如果您想避免编写过多的 SQL,则可以使用 Hibernate

回答by giri

Using JTidyyou can scrap data from HTML. Then yoou can use JDBC.

使用JTidy,您可以从 HTML 中抓取数据。然后你可以使用JDBC