使用 C 解析 html
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1527883/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parse html using C
提问by
I need to grab some content from an HTML (XHTML valid) page. I grab the page using curl and store it in memory.
我需要从 HTML(XHTML 有效)页面中获取一些内容。我使用 curl 抓取页面并将其存储在内存中。
I played with the idea of using regex with the PCRE library, but simply I couldn't find any examples using it with C. Then I moved on to look at HTML parsers and again there is not a good selection. All I could find was a skimpy documented module for libxml called HTMLparser.
我尝试了在 PCRE 库中使用正则表达式的想法,但我找不到任何将它与 C 一起使用的示例。然后我继续查看 HTML 解析器,但再次没有一个好的选择。我所能找到的只是一个名为 HTMLparser 的 libxml 文档模块。
Are there any alternatives? If not, then examples for what I found already?
有没有其他选择?如果没有,那么我已经找到的例子?
采纳答案by Byron Whitlock
You want to use HTML tidy to do this. The Lib curl page has some source code to get you going. Documents traversing the dom tree. You don't need an xml parser. Doesn't fail on badly formated html.
你想使用 HTML tidy 来做到这一点。Lib curl 页面有一些源代码可以帮助您前进。遍历 dom 树的文档。您不需要 xml 解析器。在格式错误的 html 上不会失败。
回答by Michael Krelin - hacker
I would use libhtmltidy+ whatever xml parser like expator libxml. Depends on what you're looking for.
我会使用libhtmltidy+ 任何 xml 解析器,如expat或libxml。取决于你在寻找什么。
回答by Anton Kochkov
Google recently created a pure C99 library for parsing HTML, HTML5 specifically. It's easy to use in any C program and actively developed.
谷歌最近创建了一个纯 C99 库来解析 HTML,特别是 HTML5。它很容易在任何 C 程序中使用并积极开发。
回答by Tony Miller
If you want to parse XML using C, then by far the best way to proceed is to use the LibXML library. The main page is at http://xmlsoft.org/. In addition to their downloads, they have explicit code examplesthat specfically show how to handle parsing. I know for a fact you can get versions precompiled for Mac and Windows, most Linux and BSD distributions have it already included, and you can build from source if you wish.
如果您想使用 C 解析 XML,那么目前最好的方法是使用 LibXML 库。主页位于http://xmlsoft.org/。除了他们的下载之外,他们还有明确的代码示例,具体展示了如何处理解析。我知道你可以获得为 Mac 和 Windows 预编译的版本,大多数 Linux 和 BSD 发行版已经包含它,如果你愿意,你可以从源代码构建。