你如何从网页(Java)中抓取文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/71491/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 10:56:38  来源:igfitidea点击:

How do you grab a text from webpage (Java)?

javahtmlhtml-content-extraction

提问by ansgri

I'm planning to write a simple J2SE application to aggregate information from multiple web sources.

我计划编写一个简单的 J2SE 应用程序来聚合来自多个 Web 源的信息。

The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I might want to extract a list of questions from stackoverflow, but I absolutely don't need that huge tag cloud or navbar.

我认为最困难的部分是从网页中提取有意义的信息,如果它不能作为 RSS 或 Atom 提要提供的话。例如,我可能想从 stackoverflow 中提取问题列表,但我绝对不需要那个巨大的标签云或导航栏。

What technique/library would you advice?

你会建议什么技术/图书馆?

Updates/Remarks

更新/备注

  • Speed doesn't matter — as long as it can parse about 5MB of HTML in less than 10 minutes.
  • It sould be really simple.
  • 速度并不重要——只要它可以在不到 10 分钟的时间内解析大约 5MB 的 HTML。
  • 它应该很简单。

回答by jatanp

You may use HTMLParser (http://htmlparser.sourceforge.net/)incombination with URL#getInputStream() to parse the content of HTML pages hosted on Internet.

您可以将 HTMLParser ( http://htmlparser.sourceforge.net/)与 URL#getInputStream() 结合使用来解析 Internet 上托管的 HTML 页面的内容。

回答by Joe Liversedge

If you want to take advantage of any structural or semantic markup, you might want to explore converting the HTML to XML and using XQuery to extract the information in a standard form. Take a look at this IBM developerWorks articlefor some typical code, excerpted below (they're outputting HTML, which is, of course, not required):

如果您想利用任何结构或语义标记,您可能需要探索将 HTML 转换为 XML 并使用 XQuery 以标准形式提取信息。看看这篇 IBM developerWorks 文章中的一些典型代码,摘录如下(它们正在输出 HTML,这当然不是必需的):

<table>
{
  for $d in //td[contains(a/small/text(), "New York, NY")]
  for $row in $d/parent::tr/parent::table/tr
  where contains($d/a/small/text()[1], "New York")
  return <tr><td>{data($row/td[1])}</td> 
           <td>{data($row/td[2])}</td>              
           <td>{$row/td[3]//img}</td> </tr>
}
</table>

回答by James Law

You could look at how httpunit does it. They use couple of decent html parsers, one is nekohtml. As far as getting data you can use whats built into the jdk (httpurlconnection), or use apache's

你可以看看 httpunit 是怎么做的。他们使用了几个不错的 html 解析器,一个是 nekohtml。至于获取数据,您可以使用 jdk 中内置的内容(httpurlconnection),或使用 apache 的

http://hc.apache.org/httpclient-3.x/

http://hc.apache.org/httpclient-3.x/

回答by Alexandre Victoor

You can use nekohtmlto parse your html document. You will get a DOM document. You may use XPATH to retrieve data you need.

您可以使用nekohtml来解析您的 html 文档。您将获得一个 DOM 文档。您可以使用 XPATH 来检索您需要的数据。

回答by Maxim

If your "web sources" are regular websites using HTML (as opposed to structured XML format like RSS) I would suggest to take a look at HTMLUnit.

如果您的“网络资源”是使用 HTML 的常规网站(而不是像 RSS 这样的结构化 XML 格式),我建议您查看HTMLUnit

This library, while targeted for testing, is a really general purpose "Java browser". It is built on a Apache httpclient, Nekohtml parser and Rhino for Javascript support. It provides a really nice API to the web page and allows to traverse website easily.

这个库虽然用于测试,但实际上是一个通用的“Java 浏览器”。它建立在 Apache httpclient、Nekohtml 解析器和 Rhino 上,以提供 Javascript 支持。它为网页提供了一个非常好的 API,并允许轻松遍历网站。

回答by Eric DeLabar

Have you considered taking advantage of RSS/Atom feeds? Why scrape the content when it's usually available for you in a consumable format? There are libraries available for consuming RSS in just about any language you can think of, and it'll be a lot less dependent on the markup of the page than attempting to scrape the content.

您是否考虑过利用 RSS/Atom 提要?当内容通常以可消耗的形式提供给您时,为什么还要抓取内容?有一些库可用于使用您能想到的几乎任何语言的 RSS,并且与尝试抓取内容相比,它对页面标记的依赖要少得多。

If you absolutely MUST scrape content, look for microformatsin the markup, most blogs (especially WordPress based blogs) have this by default. There are also libraries and parsers available for locating and extracting microformats from webpages.

如果您绝对必须抓取内容,请在标记中查找微格式,大多数博客(尤其是基于 WordPress 的博客)默认都有这个。还有一些库和解析器可用于从网页中定位和提取微格式。

Finally, aggregation services/applications such as Yahoo Pipesmay be able to do this work for you without reinventing the wheel.

最后,聚合服务/应用程序(例如Yahoo Pipes)可能能够为您完成这项工作,而无需重新发明轮子。

回答by VNVN

Check this out http://www.alchemyapi.com/api/demo.html

看看这个http://www.alchemyapi.com/api/demo.html

They return pretty good results and have an SDK for most platforms. Not only text extraction but they do keywords analysis etc.

他们返回了相当不错的结果,并且有一个适用于大多数平台的 SDK。不仅文本提取,而且他们做关键字分析等。

回答by Vhaerun

If you want to do it the old fashioned way , you need to connect with a socket to the webserver's port , and then send the following data :

如果你想用老式的方式来做,你需要用一个套接字连接到网络服务器的端口,然后发送以下数据:

GET /file.html HTTP/1.0
Host: site.com
<ENTER>
<ENTER>

then use the Socket#getInputStream, and then read the data using a BufferedReader , and parse the data using whatever you like.

然后使用Socket#getInputStream,然后使用 BufferedReader 读取数据,并使用您喜欢的任何方式解析数据。

回答by Vhaerun

In short, you may either parse the whole page and pick things you need(for speed I recommend looking at SAXParser) or running the HTML through a regexp that trims of all of the HTML... you can also convert it all into DOM, but that's going to be expensive especially if you're shooting for having a decent throughput.

简而言之,您可以解析整个页面并选择您需要的内容(为了速度,我建议查看 SAXParser)或通过修剪所有 HTML 的正则表达式运行 HTML……您也可以将其全部转换为 DOM,但这会很昂贵,尤其是如果您要获得不错的吞吐量。

回答by Vhaerun

You seem to want to screen scrape. You would probably want to write a framework which via an adapter / plugin per source site (as each site's format will differ), you could parse the html source and extract the text. you would prob use java's io API to connect to the URL and stream the data via InputStreams.

您似乎想要屏幕抓取。您可能想要编写一个框架,通过每个源站点的适配器/插件(因为每个站点的格式会有所不同),您可以解析 html 源并提取文本。您可能会使用 java 的 io API 连接到 URL 并通过 InputStreams 流式传输数据。