如何从头开始创建/编写一个简单的 XML 解析器?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6239756/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 14:47:02  来源:igfitidea点击:

How to create/write a simple XML parser from scratch?

xmldxml-parsing

提问by XP1

How to create/write a simple XML parser from scratch?

如何从头开始创建/编写一个简单的 XML 解析器?

Rather than code samples, I want to know what are the simplified, basic steps in English.

而不是代码示例,我想知道什么是简化的英文基本步骤。

How is a good parser designed? I understand that regex should not be used in a parser, but how much is regex's role in parsing XML?

一个好的解析器是如何设计的?我知道不应该在解析器中使用正则表达式,但是正则表达式在解析 XML 中的作用有多大?

What is the recommended data structure to use? Should I use linked lists to store and retrieve nodes, attributes, and values?

推荐使用的数据结构是什么?我应该使用链表来存储和检索节点、属性和值吗?

I want to learn how to create an XML parser so that I can write one in D programming language.

我想学习如何创建一个 XML 解析器,以便我可以用 D 编程语言编写一个。

回答by Michael Kay

If you don't know how to write a parser, then you need to do some reading. Get hold of any book on compiler-writing (many of the best ones were written 30 or 40 years ago, e.g. Aho and Ullmann) and study the chapters on lexical analysis and syntax analysis. XML is essentially no different, except that the lexical and grammar phases are not as clearly isolated from each other as in some languages.

如果您不知道如何编写解析器,那么您需要进行一些阅读。找一本关于编译器编写的书(许多最好的书是 30 或 40 年前写的,例如 Aho 和 Ullmann),并研究有关词法分析和语法分析的章节。XML 本质上没有什么不同,只是词汇和语法阶段不像在某些语言中那样彼此明确隔离。

One word of warning, if you want to write a fully-conformant XML parser then 90% of your effort will be spent getting edge cases right in obscure corners of the spec dealing with things such as parameter entities that most XML users aren't even aware of.

一个警告,如果您想编写一个完全一致的 XML 解析器,那么您 90% 的努力将用于在规范的模糊角落处理边缘情况,例如大多数 XML 用户甚至不知道的参数实体意识到。

回答by GolezTrol

There is a difference between a parser and a nodelist. The parser is the piece that takes a bunch of plain text XML and tries to determine what nodes are in there. Then there is an internal structure you save the nodes in. In a layer over that structure you find the DOM, the Document Object Model. This is a structure of nested nodes that make up your XML document. The parser only needs to know the generic DOM interface to create nodes.

解析器和节点列表之间存在差异。解析器是获取一堆纯文本 XML 并尝试确定其中有哪些节点的部分。然后有一个内部结构,您将节点保存在其中。在该结构上的一层中,您可以找到 DOM,即文档对象模型。这是构成 XML 文档的嵌套节点的结构。解析器只需要知道通用 DOM 接口即可创建节点。

I wouldn't use regex as a parser for this. I think the best thing is just traverse the string char by char and check if what you get matches with what you should get.

我不会使用正则表达式作为解析器。我认为最好的办法是一个字符一个字符地遍历字符串并检查你得到的是否与你应该得到的匹配。

But why not use any of the existing XML parsers? There are many possibilities in encoding data. Many exceptions. And if your parsers doesn't manage them all it is hardly worth the title of XML parser.

但是为什么不使用任何现有的 XML 解析器呢?编码数据有很多可能性。许多例外。如果您的解析器不能管理所有这些,那么 XML 解析器的称号几乎不值一提。

回答by ratchet freak

for and event based parser the user need to pass it some functions (startNode(name,attrs), endNode(name)and someText(txt)likely through an interface) and call them when needed as you pass over the file

并基于事件的解析器,用户需要通过它的一些功能(startNode(name,attrs)endNode(name)someText(txt)有可能通过一个接口),并在需要的时候打电话给他们,你传过来的文件

the parser will have a while loop that will alternate between reading until <and until >and do the proper conversions to the parameter types

解析器将有一个 while 循环,它将在读取 until<和 until之间交替>,并对参数类型进行适当的转换

void parse(EventParser p, File file){
    string str;
    while((str = file.readln('<')).length !=0){
        //not using a rewritable buffer to take advantage of slicing 
        //but it's a quick conversion to a implementation with a rewritable buffer though
        if(str.length>1)p.someText(str.chomp('<'));


        str = file.readln('>');
        str = str.chomp('>');

        //split str in name and attrs
        auto parts = str.split();
        string name = parts[0];
        string[string] attrs;
        foreach(attribute;parts[1..$]){
            auto splitAtrr = attribute.split("=");
            attrs[splitAtrr[0]] = splitAtrr[1];
        }

        if(str[0] == '/')p.endNode(name);
        else {
            p.startNode(name,attrs);
            if(str[str.length-1]=='/')p.endNode(name);//self closing tag
        }
    }
}


you can build a DOM parser on top of a event based parser and the basic functionality you'll need for each node is getChildren and getParent getName and getAttributes (with setters when building ;) )

您可以在基于事件的解析器之上构建 DOM 解析器,每个节点所需的基本功能是 getChildren 和 getParent getName 和 getAttributes(构建时使用 setter ;))

the object for the dom parser with the above described methods:

具有上述方法的 dom 解析器的对象:

class DOMEventParser : EventParser{
    DOMNode current = new RootNode();
    overrides void startNode(string name,string[string] attrs){
        DOMNode tmp = new ElementNode(current,name,attrs);
        current.appendChild(tmp);
        current = tmp;
    }
    overrides void endNode(string name){
        asser(name == current.name);
        current = current.parent;
    }
    overrides void someText(string txt){
        current.appendChild(new TextNode(txt));
    }
}

when the parsing ends the rootnode will have the root of the DOM tree

当解析结束时,rootnode 将拥有 DOM 树的根

note: I didn't put any verification code in there to ensure correctness of the xml

注意:我没有在里面放任何验证码来确保xml的正确性

edit: the parsing of the attributes has a bug in it, instead of splitting on whitespace a regex is better for that

编辑:属性的解析中有一个错误,而不是在空格上拆分正则表达式更好

回答by Julio Guerra

A parser must fit the needs of your input language. In your case, simple XML. The first thing to know about XML is that it is context-free and absolutely not ambiguous, everything is wrapped between two tokens, and this is what makes XML famous: it is easy to parse. Finally, XML is always simply represented by a tree structure. As stated, you can simply parse your XML and execute code in the meantime, or parse the XML, generating the tree, and then execute code according to this tree.

解析器必须适合您的输入语言的需要。在您的情况下,简单的 XML。关于 XML,首先要知道的是它是上下文无关的并且绝对没有歧义,一切都包裹在两个标记之间,这就是 XML 出名的原因:它易于解析。最后,XML 总是简单地由树结构表示。如前所述,您可以简单地解析 XML 并同时执行代码,或者解析 XML,生成树,然后根据此树执行代码。

D provides a very interesting way to write an XML parser very easily, for example:

D 提供了一种非常有趣的方式来非常轻松地编写 XML 解析器,例如:

doc.onStartTag["pointlight"] = (ElementParser xml)
{
  debug writefln("Parsing pointlight element");

  auto l = new DistantLight(to!int(xml.tag.attr["x"]),
                            to!int(xml.tag.attr["y"]),
                            to!int(xml.tag.attr["z"]),
                            to!ubyte(xml.tag.attr["red"]),
                            to!ubyte(xml.tag.attr["green"]),
                            to!ubyte(xml.tag.attr["blue"]));
  lights ~= l;

  xml.parse();
};

回答by Samuel Lampa

Since D is rather closely related to Java, maybe generating an XML parser with ANTLR(since there are most probably XML EBNFgrammars for ANTLR already, you could then use these), and then converting the generated Java parser code to D, could be an option? At least that would give you a starting point, and you could then put some efforts in trying optimizing the code specifically for D ...

由于 D 与 Java 密切相关,因此可能使用ANTLR生成 XML 解析器(因为很可能已经有用于 ANTLR 的XML EBNF语法,您可以使用这些),然后将生成的 Java 解析器代码转换为 D,可能是选项?至少这会给你一个起点,然后你可以付出一些努力来尝试专门为 D 优化代码......

At least ANTLR is not at all as hard as many seem to think. I got started after knowing nothing about it, by watching 3-4 of this great set of screencasts on ANTLR.

至少 ANTLR 并不像许多人想象的那么难。我在对此一无所知后开始了,通过在 ANTLR 上观看了这组精彩的截屏视频中的3-4 个。

Btw, I found ANTLRWorksa breeze to work with (as opposed to the Eclipse plugin used in the screencast ... but the screencast content applies anyway).

顺便说一句,我发现ANTLRWorks使用起来轻而易举(与截屏中使用的 Eclipse 插件相反……但截屏内容无论如何都适用)。

Just my 0.02c.

只是我的 0.02c。

回答by Mauve Ranger

The first element in the document should be the prolog. This states the xml version, the encoding, whether the file is standalone, and maybe some other stuff. The prolog opens with <?.

文档中的第一个元素应该是序言。这说明了 xml 版本、编码、文件是否是独立的,也许还有其他一些东西。序言以<?.

After the prolog, there's tags with metadata. The special tags, like comments, doctypes, and element definitions should start with <!. Processing instructions start with <?. It is possible to have nested tags here, as the <!DOCTYPEtag can have <!ELEMENTand <!ATTLISTtags in a dtd style xml document--see Wikipediafor a thorough example.

在序言之后,有带有元数据的标签。特殊标签,如注释、文档类型和元素定义应以<!. 处理指令以<?. 这里可能有嵌套标签,因为<!DOCTYPE标签可以在 dtd 样式的 xml 文档中包含<!ELEMENT<!ATTLIST标签——请参阅维基百科以获得完整示例。

There should be exactly one top level element. It's the only one without a <!or a <?preceding it. There may be more metadata tags after the top level element; process those first.

应该只有一个顶级元素。它是唯一一个前面没有 a<!或 a 的<?。顶层元素之后可能会有更多的元数据标签;先处理那些。

For the explicit parsing: First identify tags--they all start with <--then determine what kind of tag it is and what its closure looks like. <!--is a comment tag, and cannot have --anywhere except for its end. <?ends with ?>. <!end with >. To repeat: <!DOCTYPEcan have tags nested before its closure, and there may be other nested tags I don't know of.

对于显式解析:首先识别标签——它们都以——开头,<然后确定它是什么类型的标签以及它的闭包是什么样的。<!--是一个注释标签,--除了结尾之外不能有任何地方。<??>.结尾 <!>.结尾 重复:<!DOCTYPE可以在其关闭之前嵌套标签,并且可能还有其他我不知道的嵌套标签。

Once you find a tag, you'll want to find its closing tag. Check if the tag is self closing first; otherwise, find its closure.

一旦你找到一个标签,你就会想要找到它的结束标签。首先检查标签是否自动关闭;否则,找到它的闭包。

For data structures: I would recommend a tree structure, where each element is a node, and each node has an indexed/mapped list of subelements.

对于数据结构:我会推荐树结构,其中每个元素都是一个节点,每个节点都有一个索引/映射的子元素列表。

Obviously, a full parser will require a lot more research; I hope this is enough to get you started.

显然,一个完整的解析器需要更多的研究;我希望这足以让你开始。