使用 Scala 进行网页抓取

Question

提问by Michael Tingley

Just wondering if anyone knows of a web-scraping library that takes advantage of Scala's succinct syntax. So far, I've found Chafe, but this seems poorly-documented and maintained. I'm wondering if anyone out there has done scraping with Scala and has advice. (I'm trying to integrate into an existing Scala framework rather than use a scraper written in, say, Python.)

只是想知道是否有人知道利用 Scala 简洁语法的网络抓取库。到目前为止，我已经找到了Chafe，但这似乎没有很好的记录和维护。我想知道是否有人已经使用 Scala 进行了抓取并提供了建议。（我正在尝试集成到现有的 Scala 框架中，而不是使用用 Python 编写的刮刀。）

Answer 1

采纳答案by Adam Gent

First there is a plethora of HTML scraping libs in JVM all you need to do is pimp one of them (pimp my library pattern).

首先，JVM 中有大量的 HTML 抓取库，您需要做的就是对其中一个进行皮条客（pimp my library 模式）。

The four I have used are:

我用过的四个是：

HtmlUnit - Will emulate the browser and even run Javascript
Jericho - Preserves formatting and ideal if you want to edit the scraped HTML
NekoHtml
JSoup -- ~~does not work with Scala~~. Might work

HtmlUnit - 将模拟浏览器甚至运行 Javascript
Jericho - 如果您想编辑抓取的 HTML，则保留格式和理想选择
NekoHtml
JSoup -~~不适用于 Scala~~。可能有用

I have used Selenium but never for scraping. Scala has a wrapper around selenium.

我使用过硒，但从未用于抓取。Scala 有一个关于 selenium 的包装器。

I would recommend pimping an existing Java library over some half baked Scala lib.

我建议在一些半成品 Scala 库上拉皮条现有的 Java 库。

Answer 2

回答by overthink

I don't have a Scala-specific recommendation, but for the JVM in general I've had good success with:

我没有特定于 Scala 的建议，但对于 JVM 而言，我在以下方面取得了很好的成功：

JSoupYou can CSS selectors to "scrape" the document. Really nice to work with.
Use Tagsoupto get your input HTML to XML, then use XML processors to "Scrape".

JSoup您可以使用 CSS 选择器来“抓取”文档。真的很高兴合作。
使用Tagsoup将您的输入 HTML 转换为 XML，然后使用 XML 处理器“抓取”。

The Tagsoup route actually works quite well with Scala since Scala's built-in XML "dsl" is pretty concise (if you can forgive its perf issues and occasional API weirdness). Also, Tagsoup will handle nearly any garbage document you give it. It also has niceties like built-in understanding of many HTML entities that other SAXParsers will choke on as being undeclared.

Tagsoup 路由实际上与 Scala 配合得很好，因为 Scala 的内置 XML“dsl”非常简洁（如果你能原谅它的性能问题和偶尔的 API 怪异）。此外，Tagsoup 将处理您提供的几乎所有垃圾文件。它还有一些优点，比如对许多 HTML 实体的内置理解，其他 SAXParsers 会因为未声明而窒息。

tl;dr- JSoup + CSS selectors if possible, otherwise Tagsoup + scala XML. If slow is ok, tagsoup first, then jsoup the result.

tl;dr- 如果可能，JSoup + CSS 选择器，否则 Tagsoup + scala XML。如果慢没问题，先 tagsoup，然后 jsoup 结果。

Answer 3

回答by scalapeno

I'd recommend Goose: https://github.com/jiminoc/goose

我推荐 Goose：https: //github.com/jiminoc/goose

It's not as general-use as you might need but if you are scraping article content from popular sites, it may work out of the box. It also provides a framework for you to work from if you want to extend their code to cover other sites.

它不像您可能需要的那样通用，但如果您从流行网站抓取文章内容，它可能开箱即用。如果您想扩展他们的代码以覆盖其他站点，它还提供了一个框架供您工作。

使用 Scala 进行网页抓取

提问by Michael Tingley

采纳答案by Adam Gent

回答by overthink

回答by scalapeno

相关推荐

最近更新

标签

使用 Scala 进行网页抓取

提问by Michael Tingley

采纳答案by Adam Gent

回答by overthink

回答by scalapeno

相关推荐

Scala 柯里化 vs 部分应用函数

scala 奇怪的错误信息：错误的符号引用。package.class 中的签名指的是 package org 中不可用的术语 apache

Scala <控制台>：1：错误：';' 预期但 '(' 发现

Scala for-comprehensions 中的 Future[Option]

相关推荐

最近更新

标签