Html 解析和使用网站数据的“智能”方式?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1223458/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 00:27:33  来源:igfitidea点击:

"Smart" way of parsing and using website data?

htmlweb-servicesparsingwebpagehtml-content-extraction

提问by bluebit

How does one intelligently parse data returned by search results on a page?

如何智能解析页面搜索结果返回的数据?

For example, lets say that I would like to create a web service that searches for online books by parsing the search results of many book providers' websites. I could get the raw HTML data of the page, and do some regexs to make the data work for my web service, but if any of the websites change the formatting of the pages, my code breaks!

例如,假设我想创建一个网络服务,通过解析许多图书提供商网站的搜索结果来搜索在线图书。我可以获取页面的原始 HTML 数据,并执行一些正则表达式以使数据适用于我的 Web 服务,但是如果任何网站更改了页面的格式,我的代码就会中断!

RSS is indeed a marvelous option, but many sites don't have an XML/JSON based search.

RSS 确实是一个了不起的选择,但许多站点没有基于 XML/JSON 的搜索。

Are there any kits out there that help disseminate information on pages automatically? A crazy idea would be to have a fuzzy AI module recognize patterns on a search results page, and parse the results accordingly...

是否有任何工具包可以帮助自动在页面上传播信息?一个疯狂的想法是让模糊 AI 模块识别搜索结果页面上的模式,并相应地解析结果......

采纳答案by BobMcGee

I've done some of this recently, and here are my experiences.

我最近做了一些,这里是我的经验。

There are three basic approaches:

有以下三种基本方法:

  1. Regular Expressions.
    • Most flexible, easiest to use with loosely-structured info and changing formats.
    • Harder to do structural/tag analysis, but easier to do text matching.
    • Built in validation of data formatting.
    • Harder to maintain than others, because you have to write a regular expression for each pattern you want to use to extract/transform the document
    • Generally slower than 2 and 3.
    • Works well for lists of similarly-formatted items
    • A good regex development/testing tool and some sample pages will help. I've got good things to say about RegexBuddy here. Try their demo.
    • I've had the most success with this. The flexibility lets you work with nasty, brutish, in-the-wild HTML code.
  2. Convert HTML to XHTML and use XML extraction tools.Clean up HTML, convert it to legal XHTML, and use XPath/XQuery/ X-whatever to query it as XML data.
    • Tools: TagSoup, HTMLTidy, etc
    • Quality of HTML-to-XHML conversion is VERY important, and highly variable.
    • Best solution if data you want is structured by the HTML layout and tags (data in HTML tables, lists, DIV/SPAN groups, etc)
    • Most suitable for getting link structures, nested tables, images, lists, and so forth
    • Should be faster than option 1, but slower than option 3.
    • Works well if content formatting changes/is variable, but document structure/layout does not.
    • If the data isn't structured by HTML tags, you're in trouble.
    • Can be used with option 1.
  3. Parser generator (ANTLR, etc)-- create a grammar for parsing & analyzing the page.
    • I have not tried this because it was not suitable for my (messy) pages
    • Most suitable if HTML structure is highly structured, very constant, regular, and never changes.
    • Use this if there are easy-to-describe patterns in the document, but they don't involve HTML tags and involve recursion or complex behaviors
    • Does not require XHTML input
    • FASTEST throughput, generally
    • Big learning curve, but easier to maintain
  1. 常用表达。
    • 最灵活、最容易使用结构松散的信息和不断变化的格式。
    • 更难做结构/标签分析,但更容易做文本匹配。
    • 内置数据格式验证。
    • 比其他人更难维护,因为您必须为要用于提取/转换文档的每个模式编写一个正则表达式
    • 一般比2和3慢。
    • 适用于类似格式的项目列表
    • 一个好的正则表达式开发/测试工具和一些示例页面会有所帮助。我有关于 RegexBuddy 的好话要说。试试他们的演示。
    • 我在这方面取得了最大的成功。灵活性让您可以使用讨厌的、粗鲁的、随意的 HTML 代码。
  2. 将 HTML 转换为 XHTML 并使用 XML 提取工具。清理 HTML,将其转换为合法的 XHTML,并使用 XPath/XQuery/X-whatever 将其作为 XML 数据进行查询。
    • 工具:TagSoup、HTMLTidy 等
    • HTML 到 XHML 的转换质量非常重要,而且变化很大。
    • 如果您想要的数据由 HTML 布局和标签(HTML 表格、列表、DIV/SPAN 组等中的数据)构成,则最佳解决方案
    • 最适合获取链接结构、嵌套表、图像、列表等
    • 应该比选项 1 快,但比选项 3 慢。
    • 如果内容格式更改/可变,则效果很好,但文档结构/布局没有。
    • 如果数据不是由 HTML 标记构成的,那么您就有麻烦了。
    • 可与选项 1 一起使用。
  3. 解析器生成器(ANTLR 等)——创建用于解析和分析页面的语法。
    • 我没有尝试过这个,因为它不适合我的(凌乱的)页面
    • 如果 HTML 结构高度结构化、非常恒定、规则且永不改变,则最适合。
    • 如果文档中有易于描述的模式,但它们不涉及 HTML 标签并且涉及递归或复杂行为,则使用此选项
    • 不需要 XHTML 输入
    • 最快的吞吐量,一般
    • 学习曲线大,但更易于维护

I've tinkered with web harvestfor option 2, but I find their syntax to be kind of weird. Mix of XML and some pseudo-Java scripting language. If you like Java, and like XML-style data extraction (XPath, XQuery) that might be the ticket for you.

我已经修改了选项 2 的网络收获,但我发现它们的语法有点奇怪。XML 和一些伪 Java 脚本语言的混合。如果您喜欢 Java,并且喜欢 XML 样式的数据提取(XPath、XQuery),那么这可能是您的最佳选择。



Edit: if you use regular expressions, make sure you use a library with lazy quantifiers and capturing groups! PHP's older regex libraries lack these, and they're indispensable for matching data between open/close tags in HTML.

编辑:如果您使用正则表达式,请确保使用带有惰性量词和捕获组的库!PHP 较旧的正则表达式库缺少这些,它们对于匹配 HTML 中打开/关闭标记之间的数据是必不可少的。

回答by Aiden Bell

Without a fixedHTML structure to parse, I would hate to maintain regular expressions for finding data. You might have more luck parsing the HTML through a proper parser that builds the tree. Then select elements ... that would be more maintainable.

如果没有固定的HTML 结构来解析,我会讨厌维护用于查找数据的正则表达式。通过构建树的适当解析器解析 HTML 时,您可能会更幸运。然后选择元素......这将更易于维护。

Obviously the best way is some XML output from the engine with a fixed markup that you can parse and validate. I would think that a HTML parsing library with some 'in the dark' probing of the produced tree would be simpler to maintain than regular expressions.

显然,最好的方法是来自引擎的一些带有固定标记的 XML 输出,您可以解析和验证这些标记。我认为对生成的树进行一些“在黑暗中”探测的 HTML 解析库比正则表达式更易于维护。

This way, you just have to check on <a href="blah" class="cache_link">...turning into <a href="blah" class="cache_result">...or whatever.

这样,你只需要检查<a href="blah" class="cache_link">...转向<a href="blah" class="cache_result">...或其他什么。

Bottom line, grepping specific elements with regexp would be grim. A better approach is to build a DOM like model of the page and look for 'anchors' to character data in the tags.

最重要的是,使用正则表达式 grepping 特定元素将是严峻的。更好的方法是构建一个类似 DOM 的页面模型,并在标签中寻找字符数据的“锚点”。

Or send an emailto the site stating a case for a XML API ... you might get hired!

或者站点发送一封电子邮件,说明 XML API 的案例……您可能会被录用!

回答by Rich Seller

You don't say what language you're using. In Java land you can use TagSoupand XPath to help minimise the pain. There's an example from this blog(of course the XPath can get a lot more complicated as your needs dictate):

你没有说你使用的是什么语言。在 Java 领域,您可以使用TagSoup和 XPath 来帮助减少痛苦。此博客中有一个示例(当然,根据您的需要,XPath 可能会变得更加复杂):

URL url = new URL("http://example.com");
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build a JDOM tree from a SAX stream provided by tagsoup
Document doc = builder.build(url);
JDOMXPath titlePath = new JDOMXPath("/h:html/h:head/h:title");
titlePath.addNamespace("h","http://www.w3.org/1999/xhtml");
String title = ((Element)titlePath.selectSingleNode(doc)).getText();
System.out.println("Title is "+title);

I'd recommend externalising the XPath expressions so you have some measure of protection if the site changes.

我建议将 XPath 表达式外部化,以便在站点更改时您有一些保护措施。

Here's an example XPath I'm definitely not using to screenscrape this site. No way, not me:

这是一个示例 XPath,我绝对不会用来截屏此站点。没办法,不是我:

"//h:div[contains(@class,'question-summary')]/h:div[@class='summary']//h:h3"

回答by Jon Galloway

You haven't mentioned which technology stack you're using. If you're parsing HTML, I'd use a parsing library:

您还没有提到您使用的是哪种技术堆栈。如果您正在解析 HTML,我会使用解析库:

  • 美汤(Python)
  • HTML 敏捷包 (.NET)

There are also webservices that do exactly what you're saying - commercial and free. They scrape sites and offer webservice interfaces.

还有一些网络服务完全符合您的要求 - 商业和免费。他们抓取网站并提供网络服务接口。

And a generic webservice that offers some screen scraping is Yahoo Pipes. previous stackoverflow question on that

提供一些屏幕抓取的通用网络服务是 Yahoo Pipes。之前的 stackoverflow 问题

回答by Jared

It isn't foolproof but you may want to look at a parser such as Beautiful SoupIt won't magically find the same info if the layout changes but it's a lot easier then writing complex regular expressions. Note this is a python module.

它不是万无一失的,但您可能想要查看诸如Beautiful Soup 之类的解析器。如果布局发生变化,它不会神奇地找到相同的信息,但它比编写复杂的正则表达式要容易得多。请注意,这是一个 python 模块。

回答by Al.

Unfortunately 'scraping' is the most common solution, as you said attempting to parse HTML from websites. You could detect structural changes to the page and flag an alert for you to fix, so a change at their end doesn't result in bum data. Until the semantic web is a reality, that's pretty much the only way to guarantee a large dataset.

不幸的是,“抓取”是最常见的解决方案,正如您所说的尝试从网站解析 HTML。您可以检测页面的结构更改并标记警报以供您修复,因此更改结束时不会导致数据丢失。在语义网成为现实之前,这几乎是保证大型数据集的唯一方法。

Alternatively you can stick to small datasets provided by APIs. Yahoo are working very hard to provide searchable data through APIs (see YDN), I think the Amazon API opens up a lot of book data, etc etc.

或者,您可以坚持使用 API 提供的小型数据集。雅虎正在努力通过 API 提供可搜索的数据(参见 YDN),我认为亚马逊 API 开放了大量书籍数据等。

Hope that helps a little bit!

希望能有点帮助!

EDIT: And if you're using PHP I'd recommend SimpleHTMLDOM

编辑:如果你使用 PHP 我推荐 SimpleHTMLDOM

回答by BaroqueBobcat

Have you looked into using a html manipulation library? Ruby has some pretty nice ones. eg hpricot

您是否考虑过使用 html 操作库?Ruby 有一些非常好的。例如hpricot

With a good library you could specify the parts of the page you want using CSS selectors or xpath. These would be a good deal more robust than using regexps.

有了一个好的库,您可以使用 CSS 选择器或 xpath 指定您想要的页面部分。这些将比使用正则表达式更强大。

Example from hpricot wiki:

来自hpricot wiki的示例:

 doc = Hpricot(open("qwantz.html"))
 (doc/'div img[@src^="http://www.qwantz.com/comics/"]')
   #=> Elements[...]

I am sure you could find a library that does similar things in .NET or Python, etc.

我相信你可以找到一个在 .NET 或 Python 等中做类似事情的库。

回答by filippo

Try googling for screen scraping + the language you prefer. I know several options for python, you may find the equivalent for your preferred language:

尝试使用谷歌搜索屏幕抓取 + 您喜欢的语言。我知道 python 的几个选项,您可能会找到与您的首选语言等效的选项:

  • Beatiful Soup
  • mechanize: similar to perl WWW:Mechanize. Gives you a browser like object to ineract with web pages
  • lxml: python binding to libwww
  • scrapemark: uses templates to scrape pieces of pages
  • pyquery: allows you to make jQuery queries in xml/xhtml documents
  • scrapy: an high level scraping and web crawling framework for writing spiders to crawl and parse web pages
  • 美汤
  • 机械化:类似于 perl WWW:Mechanize。为您提供类似浏览器的对象以与网页交互
  • lxml:python 绑定到 libwww
  • scrapemark:使用模板来抓取页面片段
  • pyquery:允许您在 xml/xhtml 文档中进行 jQuery 查询
  • scrapy:用于编写蜘蛛来抓取和解析网页的高级抓取和网络爬行框架

Depending on the website to scrape you may need to use one or more of the approaches above.

根据要抓取的网站,您可能需要使用上述一种或多种方法。

回答by Alex Black

Parsley at http://www.parselets.comlooks pretty slick.

http://www.parselets.com 上的欧芹看起来很光滑。

It lets you define 'parslets' using JSON what you're define what to look for on the page, and it then parses that data out for you.

它允许您使用 JSON 定义“parslets”,定义您在页面上查找的内容,然后为您解析该数据。

回答by cdarwin

As others have said, you can use an HTML parser that builds a DOM representation and query it with XPath/XQuery. I found a very interesting article here: Java theory and practice: Screen-scraping with XQuery - http://www.ibm.com/developerworks/xml/library/j-jtp03225.html

正如其他人所说,您可以使用 HTML 解析器来构建 DOM 表示并使用 XPath/XQuery 对其进行查询。我在这里找到了一篇非常有趣的文章:Java 理论与实践:使用 XQuery 进行屏幕抓取 - http://www.ibm.com/developerworks/xml/library/j-jtp03225.html