如何使用java从网站中提取数据？

Question

提问by giri

I am familier with java programming language I like to extract the data from a website and store it to my database running on my machine.Is that possible in java.If so which API I should use. For example the are number of schools listed on a website How can I extract that data and store it to my database using java.

我熟悉java编程语言我喜欢从网站中提取数据并将其存储到我机器上运行的数据库中。这在java中是否可行。如果是这样，我应该使用哪个API。例如，网站上列出的学校数量如何使用 java 提取该数据并将其存储到我的数据库中。

Answer 1

采纳答案by lucas

What you're referring to is commonly called 'screenscraping'. There are a variety of ways to do this in Java, however, I prefer HtmlUnit. While it was designed as a way to test web functionality, you can use it to hit a remote webpage, and parse it out.

您所指的通常称为“屏幕抓取”。在 Java 中有多种方法可以做到这一点，但是，我更喜欢HtmlUnit。虽然它被设计为一种测试 Web 功能的方式，但您可以使用它来访问远程网页，并将其解析出来。

I would recommend using a good error handling html parser like Tagsoupto extract from the HTML exactly what you're looking for.

我建议使用一个好的错误处理 html 解析器，比如Tagsoup，从 HTML 中准确地提取你正在寻找的内容。

Answer 2

回答by almathie

Depending on what you are really trying to do, you can use many different solutions.

根据您真正想做的事情，您可以使用许多不同的解决方案。

If you juste wanna fetch the HTML code of a web page, then URL.getContent() may be your solution. Here is a little tutorial :

如果您只想获取网页的 HTML 代码，那么 URL.getContent() 可能是您的解决方案。这是一个小教程：

http://www.javacoffeebreak.com/books/extracts/javanotesv3/c10/s4.html

EDIT : didn't understand he was searching for a way to parse the HTML code. Some tools have been suggested above. Sorry for that.

编辑：不明白他正在寻找一种解析 HTML 代码的方法。上面已经推荐了一些工具。对不起。

Answer 3

回答by Alex Dean

You definitely need a good parser like NekoHTML.

你肯定需要一个像 NekoHTML 这样好的解析器。

Here's an example of using NekoHTML, albeit using Groovy (a Java-based scripting language) rather than Java itself:

这是一个使用 NekoHTML 的示例，尽管使用的是 Groovy（一种基于 Java 的脚本语言）而不是 Java 本身：

http://www.keplarllp.com/blog/2010/01/better-competitive-intelligence-through-scraping-with-groovy

Answer 4

回答by vietspider

You can use VietSpider XML from

您可以使用 VietSpider XML 从

http://sourceforge.net/projects/binhgiang/files/

Download VietSpider3_16_XML_Windows.zip or VietSpider3_16_XML_Linux.zip

下载 VietSpider3_16_XML_Windows.zip 或 VietSpider3_16_XML_Linux.zip

VietSpider Web Data Extractor: Software crawls the data from the websites ((Data Scraper)), format to XML standard (Text, CDATA) then store in the relational database. Product supports the various of RDBMs such as Oracle, MySQL, SQL Server, H2, HSQL, Apache Derby, Postgres …VietSpider Crawler supports session (login, query by form input), multi-downloading, JavaScript handling, proxy (and multi-proxy by auto scan the proxies from website)…

VietSpider Web Data Extractor：软件从网站上抓取数据（（Data Scraper）），格式为 XML 标准（Text，CDATA），然后存储在关系数据库中。产品支持各种 RDBM，如 Oracle、MySQL、SQL Server、H2、HSQL、Apache Derby、Postgres……VietSpider Crawler 支持会话（登录、表单输入查询）、多下载、JavaScript 处理、代理（和多代理）通过自动扫描来自网站的代理）...

如何使用java从网站中提取数据？

提问by giri

采纳答案by lucas

回答by almathie

回答by Alex Dean

回答by vietspider

相关推荐

最近更新

标签

如何使用java从网站中提取数据？

提问by giri

采纳答案by lucas

回答by almathie

回答by Alex Dean

回答by vietspider

相关推荐

Java 如何打印出一个哈希集

Java 在我的 Web 应用程序中从 spring 中获取“未找到线程绑定请求”错误

有没有办法在 Java 中执行部分类（如 C#）？

Java 在 JSF 中绘制图形（图表）

相关推荐

最近更新

标签