HTML 抓取的选项？

Question

提问by Mark Harrison

I'm thinking of trying Beautiful Soup, a Python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well.

我正在考虑尝试Beautiful Soup，这是一个用于 HTML 抓取的 Python 包。我应该查看其他任何 HTML 抓取包吗？Python 不是必需的，我实际上也有兴趣了解其他语言。

The story so far:

到目前为止的故事：

Answer 1

采纳答案by Joey deVilla

The Ruby world's equivalent to Beautiful Soup is why_the_lucky_stiff's Hpricot.

Ruby 世界相当于 Beautiful Soup 是why_the_lucky_stiff 的Hpricot。

Answer 2

回答by Jon Galloway

In the .NET world, I recommend the HTML Agility Pack. Not near as simple as some of the above options (like HTMLSQL), but it's very flexible. It lets you maniuplate poorly formed HTML as if it were well formed XML, so you can use XPATH or just itereate over nodes.

在 .NET 世界中，我推荐 HTML Agility Pack。不像上面的一些选项（如 HTMLSQL）那么简单，但它非常灵活。它使您可以像处理格式良好的 XML 一样操作格式不正确的 HTML，因此您可以使用 XPATH 或仅迭代节点。

http://www.codeplex.com/htmlagilitypack

Answer 3

回答by Cristian

BeautifulSoup is a great way to go for HTML scraping. My previous job had me doing a lot of scraping and I wish I knew about BeautifulSoup when I started. It's like the DOM with a lot more useful options and is a lot more pythonic. If you want to try Ruby they ported BeautifulSoup calling it RubyfulSoup but it hasn't been updated in a while.

BeautifulSoup 是进行 HTML 抓取的好方法。我之前的工作让我做了很多抓取，我希望我在开始时就知道 BeautifulSoup。它就像 DOM 具有更多有用的选项，并且更加 Pythonic。如果您想尝试 Ruby，他们将 BeautifulSoup 移植为 RubyfulSoup，但它已经有一段时间没有更新了。

Other useful tools are HTMLParser or sgmllib.SGMLParser which are part of the standard Python library. These work by calling methods every time you enter/exit a tag and encounter html text. They're like Expat if you're familiar with that. These libraries are especially useful if you are going to parse very large files and creating a DOM tree would be long and expensive.

其他有用的工具是 HTMLParser 或 sgmllib.SGMLParser，它们是标准 Python 库的一部分。这些通过在您每次输入/退出标签并遇到 html 文本时调用方法来工作。如果你熟悉，他们就像外籍人士。如果您要解析非常大的文件并且创建 DOM 树又长又贵，这些库就特别有用。

Regular expressions aren't very necessary. BeautifulSoup handles regular expressions so if you need their power you can utilize it there. I say go with BeautifulSoup unless you need speed and a smaller memory footprint. If you find a better HTML parser on Python, let me know.

正则表达式不是很必要。BeautifulSoup 处理正则表达式，因此如果您需要它们的功能，您可以在那里使用它。除非您需要速度和较小的内存占用，否则我建议使用 BeautifulSoup。如果您在 Python 上找到更好的 HTML 解析器，请告诉我。

Answer 4

回答by deadprogrammer

I found HTMLSQLto be a ridiculously simple way to screenscrape. It takes literally minutes to get results with it.

我发现HTMLSQL是一种非常简单的屏幕抓取方式。只需几分钟即可获得结果。

The queries are super-intuitive - like:

查询非常直观 - 例如：

SELECT title from img WHERE $class == 'userpic'

There are now some other alternatives that take the same approach.

现在还有其他一些采用相同方法的替代方案。

Answer 5

回答by akaihola

The Python lxmllibrary acts as a Pythonic binding for the libxml2 and libxslt libraries. I like particularly its XPath support and pretty-printing of the in-memory XML structure. It also supports parsing broken HTML. And I don't think you can find other Python libraries/bindings that parse XML faster than lxml.

Python lxml库充当 libxml2 和 libxslt 库的 Pythonic 绑定。我特别喜欢它的 XPath 支持和内存中 XML 结构的漂亮打印。它还支持解析损坏的 HTML。而且我认为您找不到比 lxml 更快地解析 XML 的其他 Python 库/绑定。

Answer 6

回答by andrewrk

For Perl, there's WWW::Mechanize.

对于 Perl，有 WWW::Mechanize。

Answer 7

回答by filippo

Python has several options for HTML scraping in addition to Beatiful Soup. Here are some others:

除了 Beatiful Soup 之外，Python 还提供了多种 HTML 抓取选项。以下是其他一些：

mechanize: similar to perl WWW:Mechanize. Gives you a browser like object to ineract with web pages
lxml: Python binding to libwww. Supports various options to traverse and select elements (e.g. XPathand CSS selection)
scrapemark: high level library using templates to extract informations from HTML.
pyquery: allows you to make jQuery like queries on XML documents.
scrapy: an high level scraping and web crawling framework. It can be used to write spiders, for data mining and for monitoring and automated testing

机械化：类似于 perl WWW:Mechanize。为您提供类似浏览器的对象以与网页交互
lxml：Python 绑定到libwww. 支持各种选项来遍历和选择元素（例如XPath和 CSS 选择）
scrapemark：使用模板从 HTML 中提取信息的高级库。
pyquery：允许您对 XML 文档进行类似 jQuery 的查询。
scrapy：高级抓取和网络爬行框架。它可用于编写蜘蛛程序，用于数据挖掘以及用于监控和自动化测试

Answer 8

回答by filippo

'Simple HTML DOM Parser' is a good option for PHP, if your familiar with jQuery or JavaScript selectors then you will find yourself at home.

'Simple HTML DOM Parser' 是 PHP 的一个不错选择，如果您熟悉 jQuery 或 JavaScript 选择器，那么您会发现自己很自在。

Find it here

在这里找到它

There is also a blog post about it here.

这里还有一篇关于它的博客文章。

Answer 9

回答by cookie_monster

Why has no one mentioned JSOUP yet for Java? http://jsoup.org/

为什么还没有人提到 Java 的 JSOUP？http://jsoup.org/

Answer 10

回答by akaihola

The templatemakerutility from Adrian Holovaty (of Djangofame) uses a very interesting approach: You feed it variations of the same page and it "learns" where the "holes" for variable data are. It's not HTML specific, so it would be good for scraping any other plaintext content as well. I've used it also for PDFs and HTML converted to plaintext (with pdftotext and lynx, respectively).

来自 Adrian Holovaty（Django成名）的templatemaker实用程序使用了一种非常有趣的方法：您将同一页面的变体提供给它，它会“学习”可变数据的“漏洞”在哪里。它不是特定于 HTML 的，因此它也适用于抓取任何其他纯文本内容。我也将它用于转换为纯文本的 PDF 和 HTML（分别使用 pdftotext 和 lynx）。

HTML 抓取的选项？

提问by Mark Harrison

采纳答案by Joey deVilla

回答by Jon Galloway

回答by Cristian

回答by deadprogrammer

回答by akaihola

回答by andrewrk

回答by filippo

回答by filippo

回答by cookie_monster

回答by akaihola

相关推荐

最近更新

标签

HTML 抓取的选项？

提问by Mark Harrison

采纳答案by Joey deVilla

回答by Jon Galloway

回答by Cristian

回答by deadprogrammer

回答by akaihola

回答by andrewrk

回答by filippo

回答by filippo

回答by cookie_monster

回答by akaihola

相关推荐

C++ 程序收到信号 SIGABRT，中止

for 循环的简写 - C++中的语法糖(11)

如何使 Visual Studio 2015 C++ 项目与 Visual Studio 2010 兼容？

C++ 四舍五入到最接近的数字倍数

相关推荐

最近更新

标签