使用 Scala 进行网页抓取

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14745634/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 04:56:19  来源:igfitidea点击:

Web Scraping with Scala

scalaweb-scrapinglibraries

提问by Michael Tingley

Just wondering if anyone knows of a web-scraping library that takes advantage of Scala's succinct syntax. So far, I've found Chafe, but this seems poorly-documented and maintained. I'm wondering if anyone out there has done scraping with Scala and has advice. (I'm trying to integrate into an existing Scala framework rather than use a scraper written in, say, Python.)

只是想知道是否有人知道利用 Scala 简洁语法的网络抓取库。到目前为止,我已经找到了Chafe,但这似乎没有很好的记录和维护。我想知道是否有人已经使用 Scala 进行了抓取并提供了建议。(我正在尝试集成到现有的 Scala 框架中,而不是使用用 Python 编写的刮刀。)

采纳答案by Adam Gent

First there is a plethora of HTML scraping libs in JVM all you need to do is pimp one of them (pimp my library pattern).

首先,JVM 中有大量的 HTML 抓取库,您需要做的就是对其中一个进行皮条客(pimp my library 模式)

The four I have used are:

我用过的四个是:

  • HtmlUnit - Will emulate the browser and even run Javascript
  • Jericho - Preserves formatting and ideal if you want to edit the scraped HTML
  • NekoHtml
  • JSoup -- does not work with Scala. Might work
  • HtmlUnit - 将模拟浏览器甚至运行 Javascript
  • Jericho - 如果您想编辑抓取的 HTML,则保留格式和理想选择
  • NekoHtml
  • JSoup -不适用于 Scala可能有用

I have used Selenium but never for scraping. Scala has a wrapper around selenium.

我使用过硒,但从未用于抓取。Scala 有一个关于 selenium 的包装器

I would recommend pimping an existing Java library over some half baked Scala lib.

我建议在一些半成品 Scala 库上拉皮条现有的 Java 库。

回答by overthink

I don't have a Scala-specific recommendation, but for the JVM in general I've had good success with:

我没有特定于 Scala 的建议,但对于 JVM 而言,我在以下方面取得了很好的成功:

  • JSoupYou can CSS selectors to "scrape" the document. Really nice to work with.
  • Use Tagsoupto get your input HTML to XML, then use XML processors to "Scrape".
  • JSoup您可以使用 CSS 选择器来“抓取”文档。真的很高兴合作。
  • 使用Tagsoup将您的输入 HTML 转换为 XML,然后使用 XML 处理器“抓取”。

The Tagsoup route actually works quite well with Scala since Scala's built-in XML "dsl" is pretty concise (if you can forgive its perf issues and occasional API weirdness). Also, Tagsoup will handle nearly any garbage document you give it. It also has niceties like built-in understanding of many HTML entities that other SAXParsers will choke on as being undeclared.

Tagsoup 路由实际上与 Scala 配合得很好,因为 Scala 的内置 XML“dsl”非常简洁(如果你能原谅它的性能问题和偶尔的 API 怪异)。此外,Tagsoup 将处理您提供的几乎所有垃圾文件。它还有一些优点,比如对许多 HTML 实体的内置理解,其他 SAXParsers 会因为未声明而窒息。

tl;dr- JSoup + CSS selectors if possible, otherwise Tagsoup + scala XML. If slow is ok, tagsoup first, then jsoup the result.

tl;dr- 如果可能,JSoup + CSS 选择器,否则 Tagsoup + scala XML。如果慢没问题,先 tagsoup,然后 jsoup 结果。

回答by scalapeno

I'd recommend Goose: https://github.com/jiminoc/goose

我推荐 Goose:https: //github.com/jiminoc/goose

It's not as general-use as you might need but if you are scraping article content from popular sites, it may work out of the box. It also provides a framework for you to work from if you want to extend their code to cover other sites.

它不像您可能需要的那样通用,但如果您从流行网站抓取文章内容,它可能开箱即用。如果您想扩展他们的代码以覆盖其他站点,它还提供了一个框架供您工作。