在 iPhone 上解析 HTML

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/405749/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 22:56:49  来源:igfitidea点击:

parsing HTML on the iPhone

iphonehtmlparsinghtml-content-extraction

提问by Sophie Alpert

Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate.

任何人都可以推荐用于 HTML 解析的 C 或 Objective-C 库吗?它需要处理无法完全验证的杂乱 HTML 代码。

Does such a library exist, or am I better off just trying to use regular expressions?

这样的库是否存在,还是我最好尝试使用正则表达式?

采纳答案by Sophie Alpert

Looks like libxml2.2comes in the SDK, and libxml/HTMLparser.hclaims the following:

看起来像是libxml2.2在 SDK 中,并libxml/HTMLparser.h声称如下:

This module implements an HTML 4.0 non-verifying parser with API compatible with the XML parser ones. It should be able to parse "real world" HTML, even if severely broken from a specification point of view.

该模块实现了一个 HTML 4.0 非验证解析器,其 API 与 XML 解析器兼容。它应该能够解析“真实世界”的 HTML,即使从规范的角度来看严重损坏。

That sounds like what I need, so I'm probably going to use that.

这听起来是我需要的,所以我可能会使用它。

回答by Albaregar

I found using hpplequite useful to parse messy HTML. Hpple project is a Objective-C wrapper on the XPathQuery library for parsing HTML. Using it you can send an XPath query and receive the result .

我发现使用hpple对解析凌乱的 HTML 非常有用。Hpple 项目是 XPathQuery 库上用于解析 HTML 的 Objective-C 包装器。使用它,您可以发送 XPath 查询并接收结果。

Requirements:

要求

-Add libxml2 includes to your project

- 将 libxml2 包含到您的项目中

  1. Menu Project->Edit Project Settings
  2. Search for setting "Header Search Paths"
  3. Add a new search path "${SDKROOT}/usr/include/libxml2"
  4. Enable recursive option
  1. 菜单项目->编辑项目设置
  2. 搜索设置“标题搜索路径”
  3. 添加新的搜索路径“${SDKROOT}/usr/include/libxml2”
  4. 启用递归选项

-Add libxml2 library to to your project

- 将 libxml2 库添加到您的项目中

  1. Menu Project->Edit Project Settings
  2. Search for setting "Other Linker Flags"
  3. Add a new search flag "-lxml2"
  1. 菜单项目->编辑项目设置
  2. 搜索设置“其他链接器标志”
  3. 添加新的搜索标志“-lxml2”

-From hppleget the following source code files an add them to your project:

- 从hpple获取以下源代码文件并将它们添加到您的项目中:

  1. TFpple.h
  2. TFpple.m
  3. TFppleElement.h
  4. TFppleElement.m
  5. XPathQuery.h
  6. XPathQuery.m
  1. TFpple.h
  2. TFpple.m
  3. TFppleElement.h
  4. TFppleElement.m
  5. XPathQuery.h
  6. XPathQuery.m

-Take a walk on w3school XPath Tutorialto feel comfortable with the XPath language.

- 浏览w3school XPath 教程,熟悉 XPath 语言。

Code Example

代码示例

#import "TFHpple.h"

NSData *data = [[NSData alloc] initWithContentsOfFile:@"example.html"];

// Create parser
xpathParser = [[TFHpple alloc] initWithHTMLData:data];

//Get all the cells of the 2nd row of the 3rd table 
NSArray *elements  = [xpathParser searchWithXPathQuery:@"//table[3]/tr[2]/td"];

// Access the first cell
TFHppleElement *element = [elements objectAtIndex:0];

// Get the text within the cell tag
NSString *content = [element content];  

[xpathParser release];
[data release];

Known issues

已知的问题

As hpple is a wrapper over XPathQuery which is another wrapper, this option probably is not the most efficient. If performance is an issue in your project, I recommend to code your own lightweight solution based on hpple and xpathquery library code.

由于 hpple 是另一个包装器 XPathQuery 的包装器,因此此选项可能不是最有效的。如果您的项目中存在性能问题,我建议您根据 hpple 和 xpathquery 库代码编写自己的轻量级解决方案。

回答by DavidAWalsh

Just in case anyone has got here by googling for a nice XPath parser and gone off and used TFHpple, Note that TFHpple uses XPathQuery. This is pretty good, but has a memory leak.

以防万一有人通过谷歌搜索一个不错的 XPath 解析器来到这里并使用 TFHpple,请注意 TFHpple 使用 XPathQuery。这很好,但有内存泄漏。

In the function *PerformXPathQuery, if the nodes are found to be nil, it jumps out before cleaning up.

在*PerformXPathQuery函数中,如果发现节点为nil,则在清理前跳出。

So where you see this bit of code: Add in the two cleanup lines.

所以你看到这段代码的地方:添加两个清理行。

  xmlNodeSetPtr nodes = xpathObj->nodesetval;
  if (!nodes)
    {
      NSLog(@"Nodes was nil.");
        /* Cleanup */
        xmlXPathFreeObject(xpathObj);
        xmlXPathFreeContext(xpathCtx);
      return nil;
    }

If you are doing a LOT of parsing, it's a vicious leak. Now.... how do I get my night back :-)

如果你进行了大量的解析,这是一个恶性泄漏。现在......我如何让我的夜晚回来:-)

回答by Ben Reeves

I wrote a lightweight wrapper around libxml which maybe useful:

我写了一个围绕 libxml 的轻量级包装器,它可能有用:

Objective-C-HMTL-Parser

Objective-C-HMTL-解析器

回答by tcurdt

This probably depends on how messy the HTML is and what you want to extract. But usually Tidydoes quite a good job. It is written in C and I guess you should be able to build and statically link it for the iPhone. You can easily install the command line version and test the results first.

这可能取决于 HTML 的混乱程度以及您要提取的内容。但通常Tidy做得很好。它是用 C 编写的,我想你应该能够为 iPhone 构建和静态链接它。您可以轻松安装命令行版本并首先测试结果。

回答by tcurdt

You may want to check out ElementParser. It provides "just enough" parsing of HTML and XML. Nice interfaces make walking around XML / HTML documents very straightforward. http://touchtank.wordpress.com/

您可能想查看 ElementParser。它提供“刚好”的 HTML 和 XML 解析。漂亮的界面使浏览 XML/HTML 文档变得非常简单。http://touchtank.wordpress.com/

回答by tore

How about using the Webkit component, and possibly third party packages such as jquery for tasks such as these? Wouldn't it be possible to fetch the html data in an invisible component and take advantage of the very mature selectors of the javascript frameworks?

如何使用 Webkit 组件以及可能的第三方包(例如 jquery)来执行此类任务?难道不能在一个不可见的组件中获取 html 数据并利用 javascript 框架的非常成熟的选择器吗?

回答by Wulkanman

We use Convertigo to parse HTML on the server side and return a clean and neat JSON web services to our Mobile Apps

我们使用 Convertigo 在服务器端解析 HTML 并将干净整洁的 JSON Web 服务返回到我们的移动应用程序

回答by dnolen

Google's GData Objective-C API reimplements NSXMLElement and other related classes that Apple removed from the iPhone SDK. You can find it here http://code.google.com/p/gdata-objectivec-client/. I've used it for dealing messaging via Jabber. Of course if your HTML is malformed (missing closing tags) this might not help much.

Google 的 GData Objective-C API 重新实现了 Apple 从 iPhone SDK 中删除的 NSXMLElement 和其他相关类。您可以在http://code.google.com/p/gdata-objectivec-client/找到它。我用它来通过 Jabber 处理消息。当然,如果您的 HTML 格式错误(缺少结束标记),这可能没有多大帮助。