HTML 抓取的选项?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2861/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Options for HTML scraping?
提问by Mark Harrison
I'm thinking of trying Beautiful Soup, a Python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well.
我正在考虑尝试Beautiful Soup,这是一个用于 HTML 抓取的 Python 包。我应该查看其他任何 HTML 抓取包吗?Python 不是必需的,我实际上也有兴趣了解其他语言。
The story so far:
到目前为止的故事:
- Python
- Ruby
- .NET
- Perl
- Java
- JavaScript
- PHP
- Most of them
采纳答案by Joey deVilla
回答by Jon Galloway
In the .NET world, I recommend the HTML Agility Pack. Not near as simple as some of the above options (like HTMLSQL), but it's very flexible. It lets you maniuplate poorly formed HTML as if it were well formed XML, so you can use XPATH or just itereate over nodes.
在 .NET 世界中,我推荐 HTML Agility Pack。不像上面的一些选项(如 HTMLSQL)那么简单,但它非常灵活。它使您可以像处理格式良好的 XML 一样操作格式不正确的 HTML,因此您可以使用 XPATH 或仅迭代节点。
回答by Cristian
BeautifulSoup is a great way to go for HTML scraping. My previous job had me doing a lot of scraping and I wish I knew about BeautifulSoup when I started. It's like the DOM with a lot more useful options and is a lot more pythonic. If you want to try Ruby they ported BeautifulSoup calling it RubyfulSoup but it hasn't been updated in a while.
BeautifulSoup 是进行 HTML 抓取的好方法。我之前的工作让我做了很多抓取,我希望我在开始时就知道 BeautifulSoup。它就像 DOM 具有更多有用的选项,并且更加 Pythonic。如果您想尝试 Ruby,他们将 BeautifulSoup 移植为 RubyfulSoup,但它已经有一段时间没有更新了。
Other useful tools are HTMLParser or sgmllib.SGMLParser which are part of the standard Python library. These work by calling methods every time you enter/exit a tag and encounter html text. They're like Expat if you're familiar with that. These libraries are especially useful if you are going to parse very large files and creating a DOM tree would be long and expensive.
其他有用的工具是 HTMLParser 或 sgmllib.SGMLParser,它们是标准 Python 库的一部分。这些通过在您每次输入/退出标签并遇到 html 文本时调用方法来工作。如果你熟悉,他们就像外籍人士。如果您要解析非常大的文件并且创建 DOM 树又长又贵,这些库就特别有用。
Regular expressions aren't very necessary. BeautifulSoup handles regular expressions so if you need their power you can utilize it there. I say go with BeautifulSoup unless you need speed and a smaller memory footprint. If you find a better HTML parser on Python, let me know.
正则表达式不是很必要。BeautifulSoup 处理正则表达式,因此如果您需要它们的功能,您可以在那里使用它。除非您需要速度和较小的内存占用,否则我建议使用 BeautifulSoup。如果您在 Python 上找到更好的 HTML 解析器,请告诉我。
回答by deadprogrammer
I found HTMLSQLto be a ridiculously simple way to screenscrape. It takes literally minutes to get results with it.
我发现HTMLSQL是一种非常简单的屏幕抓取方式。只需几分钟即可获得结果。
The queries are super-intuitive - like:
查询非常直观 - 例如:
SELECT title from img WHERE $class == 'userpic'
There are now some other alternatives that take the same approach.
现在还有其他一些采用相同方法的替代方案。
回答by akaihola
The Python lxmllibrary acts as a Pythonic binding for the libxml2 and libxslt libraries. I like particularly its XPath support and pretty-printing of the in-memory XML structure. It also supports parsing broken HTML. And I don't think you can find other Python libraries/bindings that parse XML faster than lxml.
Python lxml库充当 libxml2 和 libxslt 库的 Pythonic 绑定。我特别喜欢它的 XPath 支持和内存中 XML 结构的漂亮打印。它还支持解析损坏的 HTML。而且我认为您找不到比 lxml 更快地解析 XML 的其他 Python 库/绑定。
回答by andrewrk
For Perl, there's WWW::Mechanize.
对于 Perl,有 WWW::Mechanize。
回答by filippo
Python has several options for HTML scraping in addition to Beatiful Soup. Here are some others:
除了 Beatiful Soup 之外,Python 还提供了多种 HTML 抓取选项。以下是其他一些:
- mechanize: similar to perl
WWW:Mechanize
. Gives you a browser like object to ineract with web pages - lxml: Python binding to
libwww
. Supports various options to traverse and select elements (e.g. XPathand CSS selection) - scrapemark: high level library using templates to extract informations from HTML.
- pyquery: allows you to make jQuery like queries on XML documents.
- scrapy: an high level scraping and web crawling framework. It can be used to write spiders, for data mining and for monitoring and automated testing
回答by filippo
'Simple HTML DOM Parser' is a good option for PHP, if your familiar with jQuery or JavaScript selectors then you will find yourself at home.
'Simple HTML DOM Parser' 是 PHP 的一个不错选择,如果您熟悉 jQuery 或 JavaScript 选择器,那么您会发现自己很自在。
回答by cookie_monster
Why has no one mentioned JSOUP yet for Java? http://jsoup.org/
为什么还没有人提到 Java 的 JSOUP?http://jsoup.org/
回答by akaihola
The templatemakerutility from Adrian Holovaty (of Djangofame) uses a very interesting approach: You feed it variations of the same page and it "learns" where the "holes" for variable data are. It's not HTML specific, so it would be good for scraping any other plaintext content as well. I've used it also for PDFs and HTML converted to plaintext (with pdftotext and lynx, respectively).
来自 Adrian Holovaty(Django成名)的templatemaker实用程序使用了一种非常有趣的方法:您将同一页面的变体提供给它,它会“学习”可变数据的“漏洞”在哪里。它不是特定于 HTML 的,因此它也适用于抓取任何其他纯文本内容。我也将它用于转换为纯文本的 PDF 和 HTML(分别使用 pdftotext 和 lynx)。