HTML 解析是什么意思?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20421316/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What does HTML Parsing mean?
提问by LightningBolt?
I have heard of HTML Parser libraries like Simple HTML DOM and HTML Parser. I have also heard of questions containing HTML Parsing. What does it mean to parse HTML?
我听说过像 Simple HTML DOM 和 HTML Parser 这样的 HTML Parser 库。我也听说过包含 HTML 解析的问题。解析 HTML 是什么意思?
回答by Anshu Dwibhashi
Unlike what Spudley said, parsing is basically to resolve (a sentence) into its component parts and describe their syntactic roles.
与 Spudley 所说的不同,解析基本上是将(一个句子)解析为它的组成部分并描述它们的句法作用。
According to wikipedia, Parsing or syntactic analysis is the process of analysing a string of symbols, either in natural languageor in computer languages, according to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).
根据维基百科,解析或句法分析是根据形式语法规则分析自然语言或计算机语言中的一串符号的过程。术语解析来自拉丁文 pars (orationis),意思是(语音的)部分。
In your case, HTML parsing is basically: taking in HTML code and extracting relevant information like the title of the page, paragraphs in the page, headings in the page, links, bold text etc.
在您的情况下,HTML 解析基本上是:接收 HTML 代码并提取相关信息,例如页面标题、页面中的段落、页面中的标题、链接、粗体文本等。
Parsers:
解析器:
A computer program that parses content is called a parser. There are in general 2 kinds of parsers:
解析内容的计算机程序称为解析器。通常有两种解析器:
Top-down parsing- Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse trees using a top-down expansion of the given formal grammar rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules.
自顶向下解析- 自顶向下解析可以被视为通过使用给定形式语法规则的自顶向下扩展来搜索解析树来尝试找到输入流的最左派生。代币从左到右消耗。包含选择用于通过扩展语法规则的所有替代右侧来适应歧义。
Bottom-up parsing- A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.
自底向上解析- 解析器可以从输入开始并尝试将其重写为开始符号。直观地说,解析器尝试定位最基本的元素,然后是包含这些元素的元素,依此类推。LR 解析器是自底向上解析器的例子。用于此类解析器的另一个术语是 Shift-Reduce 解析。
A few example parsers:
一些示例解析器:
Top-down parsers:
自顶向下的解析器:
Bottom-up parsers:
自底向上解析器:
- Precedence parser
- BC (bounded context) parsing
- LR parser(Left-to-right, Rightmost derivation)
- Simple LR (SLR) parser
- LALR parser
- Canonical LR (LR(1)) parser
- GLR parser
- CYK parser
- Recursive ascent parser
- 优先解析器
- BC(有界上下文)解析
- LR语法分析程序(大号EFT到右,- [Rightmost推导)
- 简单的 LR (SLR) 解析器
- LALR 解析器
- 规范的 LR (LR(1)) 解析器
- GLR 解析器
- CYK解析器
- 递归上升解析器
Example parser:
示例解析器:
Here's an example HTML parser in python:
这是 Python 中的 HTML 解析器示例:
from HTMLParser import HTMLParser
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered a start tag:", tag
def handle_endtag(self, tag):
print "Encountered an end tag :", tag
def handle_data(self, data):
print "Encountered some data :", data
# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
Here's the output:
这是输出:
Encountered a start tag: html Encountered a start tag: head Encountered a start tag: title Encountered some data : Test Encountered an end tag : title Encountered an end tag : head Encountered a start tag: body Encountered a start tag: h1 Encountered some data : Parse me! Encountered an end tag : h1 Encountered an end tag : body Encountered an end tag : html
Encountered a start tag: html Encountered a start tag: head Encountered a start tag: title Encountered some data : Test Encountered an end tag : title Encountered an end tag : head Encountered a start tag: body Encountered a start tag: h1 Encountered some data : Parse me! Encountered an end tag : h1 Encountered an end tag : body Encountered an end tag : html
References
参考
回答by Spudley
Parsing in general applies to any computer language, and is the process of taking the code as text and producing a structure in memory that the computer can understand and work with.
解析通常适用于任何计算机语言,是将代码作为文本并在内存中生成计算机可以理解和使用的结构的过程。
Specifically for HTML, HTML parsing is the process of taking raw HTML code, reading it, and generating a DOM tree object structure from it.
特别是对于 HTML,HTML 解析是获取原始 HTML 代码、读取它并从中生成 DOM 树对象结构的过程。