xcode 获取 HTML 页面作为 XML 代码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9210280/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 23:16:20  来源:igfitidea点击:

Get an HTML page as XML code

htmlxmlxcodensxmlparser

提问by Guy Daher

I just learnt about how to parse data in Xcode using NSXMLPARSER.

我刚刚学习了如何使用 NSXMLPARSER 在 Xcode 中解析数据。

In order to do that, obviously, I will need xml files, but I am still a beginner with web programming.

为了做到这一点,显然,我需要 xml 文件,但我仍然是 Web 编程的初学者。

I am having difficulties getting an xml file from a web page. I tried to convert html to xml using some softwares but I am still not getting the format I want.

我在从网页获取 xml 文件时遇到困难。我尝试使用某些软件将 html 转换为 xml,但仍然没有得到我想要的格式。

The format that I want should be similar to this:

我想要的格式应该类似于:

<?xml version="1.0" encoding="UTF-8"?>
<Books>
    <Book id="1">
        <title>Circumference</title>
        <author>Nicholas Nicastro</author>
        <summary>Eratosthenes and the Ancient Quest to Measure the Globe.</summary>
    </Book>
    <Book id="2">
        <title>Copernicus Secret</title>
        <author>Hyman Repcheck</author>
        <summary>How the scientific revolution began</summary>
    </Book>
</Books>

So how can I get a format like this from a webpage?

那么如何从网页中获取这样的格式呢?

And one more thing: If someone knows about NSXMLPARSER using Xcode, is this the way to go to extract data from websites? I mean getting an xml file, putting it in the resource of our project and then extracting the data from it?

还有一件事:如果有人知道使用 Xcode 的 NSXMLPARSER,这是从网站提取数据的方法吗?我的意思是获取一个 xml 文件,将其放入我们项目的资源中,然后从中提取数据?

采纳答案by Paaske

HTML is also XML. So if you want to extract data from any given website, you will need to get the HTML (the source of the page) and parse it "as is", then look for the data you need.

HTML 也是 XML。因此,如果您想从任何给定网站中提取数据,您将需要获取 HTML(页面源)并“按原样”解析它,然后查找您需要的数据。

A simple website may look like this:

一个简单的网站可能如下所示:

<html>
  <head>
    <title>My website</title>
  </head>
  <body>
    <h1>welocome</h1>
    Text
    <p>paragraph</p>
  </body>
</html>

As you can see, this is valid, wellformed XML. If you are interested in the <title>, parse this XML and look for the <title>-tag.

如您所见,这是有效的、格式良好的 XML。如果您对 感兴趣<title>,请解析此 XML 并查找<title>-tag。

The problem is that browsers are not so strict with the wellformedness of HTML. A missing end tag for <p>is often tolerated. An XML-parser would normally not be that "nice" and produce an error.

问题是浏览器对 HTML 的格式没有那么严格。<p>通常可以容忍缺少结束标记 for 。XML 解析器通常不会那么“好”并产生错误。

Very often websites has rss/atom-feeds. These are pure XML and are always wellformed. These feeds are made for the purpose of getting data that is easily interpreted by XML parsers.

网站经常有 rss/atom-feeds。这些是纯 XML 并且总是格式良好的。这些提要用于获取易于被 XML 解析器解释的数据。