HTML 标签解析

Question

提问by kobik

How can I parse Name: & Valuetext from within the tag with DIHtmlParser? I tried doing it with TCLHtmlParser from Clever Components but it failed. Second question is can DIHtmlParser parse individual tags for example loop through its sub tags. Its a total nightmare for such a simple problem.

如何使用 DIHtmlParser 从标签内解析Name: & Value文本？我尝试用 Clever Components 的 TCLHtmlParser 来做，但失败了。第二个问题是 DIHtmlParser 能否解析单个标签，例如循环遍历其子标签。对于这样一个简单的问题来说，这是一场彻头彻尾的噩梦。

<div class="tvRow tvFirst hasLabel tvFirst" title="example1">
  <label class="tvLabel">Name:</label>
  <span class="tvValue">Value</span>
<div class="clear"></div></div>

<div class="tvRow tvFirst hasLabel tvFirst" title="example2">
  <label class="tvLabel">Name:</label>
  <span class="tvValue">Value</span>
<div class="clear"></div></div>

Answer 1

回答by kobik

You could use IHTMLDocument2DOM to parse whatever elements you need from the HTML:

您可以使用IHTMLDocument2DOM 从 HTML 中解析您需要的任何元素：

uses ActiveX, MSHTML;

const
  HTML =
  '<div class="tvRow tvFirst hasLabel tvFirst" title="example1">' +
  '<label class="tvLabel">Name:</label>' +
  '<span class="tvValue">Value</span>' +
  '<div class="clear"></div>' +
  '</div>';

procedure TForm1.Button1Click(Sender: TObject);
var
  doc: OleVariant;
  el: OleVariant;
  i: Integer;
begin
  doc := coHTMLDocument.Create as IHTMLDocument2;
  doc.write(HTML);
  doc.close;
  ShowMessage(doc.body.innerHTML);
  for i := 0 to doc.body.all.length - 1 do
  begin
    el := doc.body.all.item(i);
    if (el.tagName = 'LABEL') and (el.className = 'tvLabel') then
      ShowMessage(el.innerText);
    if (el.tagName = 'SPAN') and (el.className = 'tvValue') then
      ShowMessage(el.innerText);
  end;
end;

I wanted to mention another very nice HTML parser I found today: htmlp(Delphi Dom HTML Parser and Converter). It's not as flexible as the IHTMLDocument2obviously, but it's very easy to work with, fast, free, and supports Unicode for older Delphi versions.

我想提一下我今天发现的另一个非常好的 HTML 解析器：htmlp（Delphi Dom HTML Parser and Converter）。它不像IHTMLDocument2明显的那样灵活，但它很容易使用，快速，免费，并且支持旧 Delphi 版本的 Unicode。

Sample usage:

示例用法：

uses HtmlParser, DomCore;

function GetDocBody(HtmlDoc: TDocument): TElement;
var
  i: integer;
  node: TNode;
begin
  Result := nil;
  for i := 0 to HtmlDoc.documentElement.childNodes.length - 1 do
  begin
    node := HtmlDoc.documentElement.childNodes.item(i);
    if node.nodeName = 'body' then
    begin
      Result := node as TElement;
      Break;
    end;
  end;
end;

procedure THTMLForm.Button2Click(Sender: TObject);
var
  HtmlParser: THtmlParser;
  HtmlDoc: TDocument;
  i: Integer;
  body, el: TElement;
  node: TNode;
begin
  HtmlParser := THtmlParser.Create;
  try
    HtmlDoc := HtmlParser.parseString(HTML);
    try
      body := GetDocBody(HtmlDoc);
      if Assigned(body) then
        for i := 0 to body.childNodes.length - 1 do
        begin
          node := body.childNodes.item(i);
          if (node is TElement) then
          begin
            el := node as TElement;
            if (el.tagName = 'div') and (el.GetAttribute('class') = 'tvRow tvFirst hasLabel tvFirst') then
            begin
              // iterate el.childNodes here...
              ShowMessage(IntToStr(el.childNodes.length));
            end;
          end;
        end;
    finally
      HtmlDoc.Free;
    end;
  finally
    HtmlParser.Free
  end;
end;

Answer 2

回答by Sir Rufo

Use a HTML Parser to work on your html files.

使用 HTML 解析器处理您的 html 文件。

Maybe DIHtmlParserwill do the job.

也许DIHtmlParser会完成这项工作。

RegEx is not a parser and converting from HTML to JSON is not a wise option.

RegEx 不是解析器，从 HTML 转换为 JSON 不是一个明智的选择。

Answer 3

回答by Ivelin Nikolaev

One can also use a combination of HTMLP parserwith THtmlFormatter and OXml XPath parsing

还可以使用HTMLP 解析器与 THtmlFormatter 和OXml XPath 解析的组合

uses
  // Htmlp
  HtmlParser,
  DomCore,
  Formatter,
  // OXml
  OXmlPDOM,
  OXmlUtils;

function HtmlToXHtml(const Html: string): string;
var
  HtmlParser: THtmlParser;
  HtmlDoc: TDocument;
  Formatter: THtmlFormatter;
begin
  HtmlParser := THtmlParser.Create;
  try
    HtmlDoc := HtmlParser.ParseString(Html);
    try
      Formatter := THtmlFormatter.Create;
      try
        Result := Formatter.GetText(HtmlDoc);
      finally
        Formatter.Free;
      end;
    finally
      HtmlDoc.Free;
    end;
  finally
    HtmlParser.Free;
  end;
end;

type
  TCard = record
    Store: string;
    Quality: string;
    Quantity: string;
    Price: string;
  end;
  TCards = array of TCard;

function ParseCard(const Node: PXMLNode): TCard;
const
  StoreXPath = 'div[1]/ax';
  QualityXPath = 'div[3]';
  QuantityXPath = 'div[4]';
  PriceXPath = 'div[5]';
var
  CurrentNode: PXMLNode;
begin
  Result := Default(TCard);
  if Node.SelectNode(StoreXPath, CurrentNode) then
     Result.Store := CurrentNode.Text;
  if Node.SelectNode(QualityXPath, CurrentNode) then
     Result.Quality := CurrentNode.Text;
  if Node.SelectNode(QuantityXPath, CurrentNode) then
     Result.Quantity := CurrentNode.Text;
  if Node.SelectNode(PriceXPath, CurrentNode) then
     Result.Price := CurrentNode.Text;
end;

procedure THTMLForm.OpenButtonClick(Sender: TObject);
var
  Html: string;
  Xml: string;
  FXmlDocument: IXMLDocument;
  QueryNode: PXMLNode;
  XPath: string;
  NodeList: IXMLNodeList;
  i: Integer;
  Card: TCard;
begin
  Html := System.IOUtils.TFile.ReadAllText(FileNameEdit.Text, TEncoding.UTF8);
  Xml := HtmlToXHtml(Html);
  Memo.Lines.Text := Xml;

  // Parse with XPath
  FXMLDocument := CreateXMLDoc;
  FXMLDocument.WriterSettings.IndentType := itIndent;
  if not FXMLDocument.LoadFromXML(Xml) then
    raise Exception.Create('Source document is not valid');
  QueryNode := FXmlDocument.DocumentElement;
  XPath := '//div[@class="row pricetableline"]';
  NodeList := QueryNode.SelectNodes(XPath);
  for i := 0 to NodeList.Count -1 do
  begin
    Card := ParseCard(NodeList[i]);
    Memo.Lines.Text := Memo.Lines.Text + sLineBreak +
      Format('%0:s %1:s %2:s %3:s', [Card.Store, Card.Quality, Card.Quantity, Card.Price]);
  end;

  Memo.SelStart := 0;
  Memo.SelLength := 0;
end;

HTML 标签解析

提问by kobik

回答by kobik

回答by Sir Rufo

回答by Ivelin Nikolaev

相关推荐

最近更新

标签

HTML 标签解析

提问by kobik

回答by kobik

回答by Sir Rufo

回答by Ivelin Nikolaev

相关推荐

Html 如何使用 CSS 设置代码列表的样式？

Html 如何防止 html5 视频在播放前加载？

Html 缩放 HTML5 视频并打破纵横比以填充整个网站

HTML 占位符浏览器兼容性

相关推荐

最近更新

标签