如何使用java从网站中提取数据?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2044017/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 02:57:12  来源:igfitidea点击:

How to extract the data from a website using java?

javascreen-scraping

提问by giri

I am familier with java programming language I like to extract the data from a website and store it to my database running on my machine.Is that possible in java.If so which API I should use. For example the are number of schools listed on a website How can I extract that data and store it to my database using java.

我熟悉java编程语言我喜欢从网站中提取数据并将其存储到我机器上运行的数据库中。这在java中是否可行。如果是这样,我应该使用哪个API。例如,网站上列出的学校数量如何使用 java 提取该数据并将其存储到我的数据库中。

采纳答案by lucas

What you're referring to is commonly called 'screenscraping'. There are a variety of ways to do this in Java, however, I prefer HtmlUnit. While it was designed as a way to test web functionality, you can use it to hit a remote webpage, and parse it out.

您所指的通常称为“屏幕抓取”。在 Java 中有多种方法可以做到这一点,但是,我更喜欢HtmlUnit。虽然它被设计为一种测试 Web 功能的方式,但您可以使用它来访问远程网页,并将其解析出来。

I would recommend using a good error handling html parser like Tagsoupto extract from the HTML exactly what you're looking for.

我建议使用一个好的错误处理 html 解析器,比如Tagsoup,从 HTML 中准确地提取你正在寻找的内容。

回答by almathie

Depending on what you are really trying to do, you can use many different solutions.

根据您真正想做的事情,您可以使用许多不同的解决方案。

If you juste wanna fetch the HTML code of a web page, then URL.getContent() may be your solution. Here is a little tutorial :

如果您只想获取网页的 HTML 代码,那么 URL.getContent() 可能是您的解决方案。这是一个小教程:

http://www.javacoffeebreak.com/books/extracts/javanotesv3/c10/s4.html

http://www.javacoffeebreak.com/books/extracts/javanotesv3/c10/s4.html

EDIT : didn't understand he was searching for a way to parse the HTML code. Some tools have been suggested above. Sorry for that.

编辑:不明白他正在寻找一种解析 HTML 代码的方法。上面已经推荐了一些工具。对不起。

回答by Alex Dean

You definitely need a good parser like NekoHTML.

你肯定需要一个像 NekoHTML 这样好的解析器。

Here's an example of using NekoHTML, albeit using Groovy (a Java-based scripting language) rather than Java itself:

这是一个使用 NekoHTML 的示例,尽管使用的是 Groovy(一种基于 Java 的脚本语言)而不是 Java 本身:

http://www.keplarllp.com/blog/2010/01/better-competitive-intelligence-through-scraping-with-groovy

http://www.keplarllp.com/blog/2010/01/better-competitive-intelligence-through-scraping-with-groovy

回答by vietspider

You can use VietSpider XML from

您可以使用 VietSpider XML 从

http://sourceforge.net/projects/binhgiang/files/

http://sourceforge.net/projects/binhgiang/files/

Download VietSpider3_16_XML_Windows.zip or VietSpider3_16_XML_Linux.zip

下载 VietSpider3_16_XML_Windows.zip 或 VietSpider3_16_XML_Linux.zip

VietSpider Web Data Extractor: Software crawls the data from the websites ((Data Scraper)), format to XML standard (Text, CDATA) then store in the relational database. Product supports the various of RDBMs such as Oracle, MySQL, SQL Server, H2, HSQL, Apache Derby, Postgres …VietSpider Crawler supports session (login, query by form input), multi-downloading, JavaScript handling, proxy (and multi-proxy by auto scan the proxies from website)…

VietSpider Web Data Extractor:软件从网站上抓取数据((Data Scraper)),格式为 XML 标准(Text,CDATA),然后存储在关系数据库中。产品支持各种 RDBM,如 Oracle、MySQL、SQL Server、H2、HSQL、Apache Derby、Postgres……VietSpider Crawler 支持会话(登录、表单输入查询)、多下载、JavaScript 处理、代理(和多代理)通过自动扫描来自网站的代理)...