HTML 标签解析
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14348346/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HTML Tag Parsing
提问by kobik
How can I parse Name: & Valuetext from within the tag with DIHtmlParser? I tried doing it with TCLHtmlParser from Clever Components but it failed. Second question is can DIHtmlParser parse individual tags for example loop through its sub tags. Its a total nightmare for such a simple problem.
如何使用 DIHtmlParser 从标签内解析Name: & Value文本?我尝试用 Clever Components 的 TCLHtmlParser 来做,但失败了。第二个问题是 DIHtmlParser 能否解析单个标签,例如循环遍历其子标签。对于这样一个简单的问题来说,这是一场彻头彻尾的噩梦。
<div class="tvRow tvFirst hasLabel tvFirst" title="example1">
<label class="tvLabel">Name:</label>
<span class="tvValue">Value</span>
<div class="clear"></div></div>
<div class="tvRow tvFirst hasLabel tvFirst" title="example2">
<label class="tvLabel">Name:</label>
<span class="tvValue">Value</span>
<div class="clear"></div></div>
回答by kobik
You could use IHTMLDocument2
DOM to parse whatever elements you need from the HTML:
您可以使用IHTMLDocument2
DOM 从 HTML 中解析您需要的任何元素:
uses ActiveX, MSHTML;
const
HTML =
'<div class="tvRow tvFirst hasLabel tvFirst" title="example1">' +
'<label class="tvLabel">Name:</label>' +
'<span class="tvValue">Value</span>' +
'<div class="clear"></div>' +
'</div>';
procedure TForm1.Button1Click(Sender: TObject);
var
doc: OleVariant;
el: OleVariant;
i: Integer;
begin
doc := coHTMLDocument.Create as IHTMLDocument2;
doc.write(HTML);
doc.close;
ShowMessage(doc.body.innerHTML);
for i := 0 to doc.body.all.length - 1 do
begin
el := doc.body.all.item(i);
if (el.tagName = 'LABEL') and (el.className = 'tvLabel') then
ShowMessage(el.innerText);
if (el.tagName = 'SPAN') and (el.className = 'tvValue') then
ShowMessage(el.innerText);
end;
end;
I wanted to mention another very nice HTML parser I found today: htmlp
(Delphi Dom HTML Parser and Converter). It's not as flexible as the IHTMLDocument2
obviously, but it's very easy to work with, fast, free, and supports Unicode for older Delphi versions.
我想提一下我今天发现的另一个非常好的 HTML 解析器:htmlp
(Delphi Dom HTML Parser and Converter)。它不像IHTMLDocument2
明显的那样灵活,但它很容易使用,快速,免费,并且支持旧 Delphi 版本的 Unicode。
Sample usage:
示例用法:
uses HtmlParser, DomCore;
function GetDocBody(HtmlDoc: TDocument): TElement;
var
i: integer;
node: TNode;
begin
Result := nil;
for i := 0 to HtmlDoc.documentElement.childNodes.length - 1 do
begin
node := HtmlDoc.documentElement.childNodes.item(i);
if node.nodeName = 'body' then
begin
Result := node as TElement;
Break;
end;
end;
end;
procedure THTMLForm.Button2Click(Sender: TObject);
var
HtmlParser: THtmlParser;
HtmlDoc: TDocument;
i: Integer;
body, el: TElement;
node: TNode;
begin
HtmlParser := THtmlParser.Create;
try
HtmlDoc := HtmlParser.parseString(HTML);
try
body := GetDocBody(HtmlDoc);
if Assigned(body) then
for i := 0 to body.childNodes.length - 1 do
begin
node := body.childNodes.item(i);
if (node is TElement) then
begin
el := node as TElement;
if (el.tagName = 'div') and (el.GetAttribute('class') = 'tvRow tvFirst hasLabel tvFirst') then
begin
// iterate el.childNodes here...
ShowMessage(IntToStr(el.childNodes.length));
end;
end;
end;
finally
HtmlDoc.Free;
end;
finally
HtmlParser.Free
end;
end;
回答by Sir Rufo
Use a HTML Parser to work on your html files.
使用 HTML 解析器处理您的 html 文件。
Maybe DIHtmlParserwill do the job.
也许DIHtmlParser会完成这项工作。
RegEx is not a parser and converting from HTML to JSON is not a wise option.
RegEx 不是解析器,从 HTML 转换为 JSON 不是一个明智的选择。
回答by Ivelin Nikolaev
One can also use a combination of HTMLP parserwith THtmlFormatter and OXml XPath parsing
还可以使用HTMLP 解析器与 THtmlFormatter 和OXml XPath 解析的组合
uses
// Htmlp
HtmlParser,
DomCore,
Formatter,
// OXml
OXmlPDOM,
OXmlUtils;
function HtmlToXHtml(const Html: string): string;
var
HtmlParser: THtmlParser;
HtmlDoc: TDocument;
Formatter: THtmlFormatter;
begin
HtmlParser := THtmlParser.Create;
try
HtmlDoc := HtmlParser.ParseString(Html);
try
Formatter := THtmlFormatter.Create;
try
Result := Formatter.GetText(HtmlDoc);
finally
Formatter.Free;
end;
finally
HtmlDoc.Free;
end;
finally
HtmlParser.Free;
end;
end;
type
TCard = record
Store: string;
Quality: string;
Quantity: string;
Price: string;
end;
TCards = array of TCard;
function ParseCard(const Node: PXMLNode): TCard;
const
StoreXPath = 'div[1]/ax';
QualityXPath = 'div[3]';
QuantityXPath = 'div[4]';
PriceXPath = 'div[5]';
var
CurrentNode: PXMLNode;
begin
Result := Default(TCard);
if Node.SelectNode(StoreXPath, CurrentNode) then
Result.Store := CurrentNode.Text;
if Node.SelectNode(QualityXPath, CurrentNode) then
Result.Quality := CurrentNode.Text;
if Node.SelectNode(QuantityXPath, CurrentNode) then
Result.Quantity := CurrentNode.Text;
if Node.SelectNode(PriceXPath, CurrentNode) then
Result.Price := CurrentNode.Text;
end;
procedure THTMLForm.OpenButtonClick(Sender: TObject);
var
Html: string;
Xml: string;
FXmlDocument: IXMLDocument;
QueryNode: PXMLNode;
XPath: string;
NodeList: IXMLNodeList;
i: Integer;
Card: TCard;
begin
Html := System.IOUtils.TFile.ReadAllText(FileNameEdit.Text, TEncoding.UTF8);
Xml := HtmlToXHtml(Html);
Memo.Lines.Text := Xml;
// Parse with XPath
FXMLDocument := CreateXMLDoc;
FXMLDocument.WriterSettings.IndentType := itIndent;
if not FXMLDocument.LoadFromXML(Xml) then
raise Exception.Create('Source document is not valid');
QueryNode := FXmlDocument.DocumentElement;
XPath := '//div[@class="row pricetableline"]';
NodeList := QueryNode.SelectNodes(XPath);
for i := 0 to NodeList.Count -1 do
begin
Card := ParseCard(NodeList[i]);
Memo.Lines.Text := Memo.Lines.Text + sLineBreak +
Format('%0:s %1:s %2:s %3:s', [Card.Store, Card.Quality, Card.Quantity, Card.Price]);
end;
Memo.SelStart := 0;
Memo.SelLength := 0;
end;