Html 怎么办正则表达式模式在字符串中的任何地方都不匹配？

Question

提问by Salman

I am trying to match <input>type “hidden” fields using this pattern:

我正在尝试<input>使用此模式匹配类型“隐藏”字段：

/<input type="hidden" name="([^"]*?)" value="([^"]*?)" />/

This is sample form data:

这是示例表单数据：

<input type="hidden" name="SaveRequired" value="False" /><input type="hidden" name="__VIEWSTATE1" value="1H4sIAAtzrkX7QfL5VEGj6nGi+nP" /><input type="hidden" name="__VIEWSTATE2" value="0351118MK" /><input type="hidden" name="__VIEWSTATE3" value="ZVVV91yjY" /><input type="hidden" name="__VIEWSTATE0" value="3" /><input type="hidden" name="__VIEWSTATE" value="" /><input type="hidden" name="__VIEWSTATE" value="" />

But I am not sure that the type, name, and valueattributes will always appear in the same order. If the typeattribute comes last, the match will fail because in my pattern it's at the start.

但我不知道的type，name和value属性将始终出现在相同的顺序。如果该type属性最后出现，则匹配将失败，因为在我的模式中它位于开头。

Question:
How can I change my pattern so it will match regardless of the positions of the attributes in the <input>tag?

问题：
如何更改我的模式，以便无论<input>标签中属性的位置如何，它都会匹配？

P.S.:By the way I am using the Adobe Airbased RegEx Desktop Toolfor testing regular expressions.

PS：顺便说一下，我使用基于Adobe Air的RegEx 桌面工具来测试正则表达式。

Answer 1

采纳答案by Platinum Azure

Contrary to all the answers here, for what you're trying to do regex is a perfectly valid solution. This is because you are NOT trying to match balanced tags-- THAT would be impossible with regex! But you are only matching what's in one tag, and that's perfectly regular.

与此处的所有答案相反，对于您尝试执行的操作，正则表达式是一个完全有效的解决方案。这是因为您没有尝试匹配平衡标签 - 使用正则表达式是不可能的！但是您只匹配一个标签中的内容，这是完全正常的。

Here's the problem, though. You can't do it with just one regex... you need to do one match to capture an <input>tag, then do further processing on that. Note that this will only work if none of the attribute values have a >character in them, so it's not perfect, but it should suffice for sane inputs.

不过，问题就在这里。你不能只用一个正则表达式来做……你需要做一个匹配来捕获一个<input>标签，然后做进一步的处理。请注意，这仅在没有任何属性值包含>字符时才有效，因此它并不完美，但它应该足以满足理智的输入。

Here's some Perl (pseudo)code to show you what I mean:

这是一些 Perl（伪）代码来向您展示我的意思：

my $html = readLargeInputFile();

my @input_tags = $html =~ m/
    (
        <input                      # Starts with "<input"
        (?=[^>]*?type="hidden")     # Use lookahead to make sure that type="hidden"
        [^>]+                       # Grab the rest of the tag...
        \/>                         # ...except for the />, which is grabbed here
    )/xgm;

# Now each member of @input_tags is something like <input type="hidden" name="SaveRequired" value="False" />

foreach my $input_tag (@input_tags)
{
  my $hash_ref = {};
  # Now extract each of the fields one at a time.

  ($hash_ref->{"name"}) = $input_tag =~ /name="([^"]*)"/;
  ($hash_ref->{"value"}) = $input_tag =~ /value="([^"]*)"/;

  # Put $hash_ref in a list or something, or otherwise process it
}

The basic principle here is, don't try to do too much with one regular expression. As you noticed, regular expressions enforce a certain amount of order. So what you need to do instead is to first match the CONTEXT of what you're trying to extract, then do submatching on the data you want.

这里的基本原则是，不要试图用一个正则表达式做太多事情。正如您所注意到的，正则表达式强制执行一定数量的顺序。所以你需要做的是首先匹配你试图提取的内容的上下文，然后对你想要的数据进行子匹配。

EDIT:However, I will agree that in general, using an HTML parser is probably easier and better and you really should consider redesigning your code or re-examining your objectives. :-) But I had to post this answer as a counter to the knee-jerk reaction that parsing any subset of HTML is impossible: HTML and XML are both irregular when you consider the entire specification, but the specification of a tag is decently regular, certainly within the power of PCRE.

编辑：但是，我同意一般来说，使用 HTML 解析器可能更容易、更好，你真的应该考虑重新设计你的代码或重新检查你的目标。:-) 但是我不得不发布这个答案作为对解析任何 HTML 子集是不可能的下意识反应的反击：当您考虑整个规范时，HTML 和 XML 都是不规则的，但是标签的规范是体面的规则，当然在 PCRE 的权力范围内。

Answer 2

回答by tchrist

Oh Yes You CanUse Regexes to Parse HTML!

哦，是的，您可以使用正则表达式来解析 HTML！

For the task you are attempting, regexes are perfectly fine!

对于您正在尝试的任务，正则表达式非常好！

It istrue that most people underestimate the difficulty of parsing HTML with regular expressions and therefore do so poorly.

这是事实，大多数人都低估了解析HTML的正则表达式的难度，因此也如此糟糕。

But this is not some fundamental flaw related to computational theory. That silliness is parroted a lot around here, but don't you believe them.

但这并不是与计算理论相关的一些基本缺陷。这种愚蠢的事情在这里被人诟病，但你不相信他们。

So while it certainly can be done (this posting serves as an existence proof of this incontrovertible fact), that doesn't mean it?should?be.

因此，虽然它当然可以做到（这篇文章是这个无可争辩的事实的存在证明），但这并不意味着它？应该？

You must decide for yourself whether you're up to the task of writing what amounts to a dedicated, special-purpose HTML parser out of regexes. Most people are not.

您必须自己决定是否能胜任使用正则表达式编写相当于专用的、特殊用途的 HTML 解析器的任务。大多数人不是。

But Iam. ?

但我是。?

General Regex-Based HTML Parsing Solutions

基于正则表达式的通用 HTML 解析解决方案

First I'll show how easy it is to parse arbitraryHTML with regexes. The full program's at the end of this posting, but the heart of the parser is:

首先，我将展示使用正则表达式解析任意HTML是多么容易。完整的程序在这篇文章的末尾，但解析器的核心是：

for (;;) {
  given ($html) {
    last                    when (pos || 0) >= length;
    printf "\@%d=",              (pos || 0);
    print  "doctype "   when / \G (?&doctype)  $RX_SUBS  /xgc;
    print  "cdata "     when / \G (?&cdata)    $RX_SUBS  /xgc;
    print  "xml "       when / \G (?&xml)      $RX_SUBS  /xgc;
    print  "xhook "     when / \G (?&xhook)    $RX_SUBS  /xgc;
    print  "script "    when / \G (?&script)   $RX_SUBS  /xgc;
    print  "style "     when / \G (?&style)    $RX_SUBS  /xgc;
    print  "comment "   when / \G (?&comment)  $RX_SUBS  /xgc;
    print  "tag "       when / \G (?&tag)      $RX_SUBS  /xgc;
    print  "untag "     when / \G (?&untag)    $RX_SUBS  /xgc;
    print  "nasty "     when / \G (?&nasty)    $RX_SUBS  /xgc;
    print  "text "      when / \G (?&nontag)   $RX_SUBS  /xgc;
    default {
      die "UNCLASSIFIED: " .
        substr($_, pos || 0, (length > 65) ? 65 : length);
    }
  }
}

See how easythat is to read?

看看这有多容易阅读？

As written, it identifies each piece of HTML and tells where it found that piece. You could easily modify it to do whatever else you want with any given type of piece, or for more particular types than these.

正如所写的那样，它标识了每段 HTML 并告诉它在哪里找到了该段。您可以轻松地修改它以对任何给定类型的作品或比这些更特殊的类型执行任何您想要的操作。

I have no failing test cases (left :): I've successfully run this code on more than 100,000 HTML files — every single one I could quickly and easily get my hands on. Beyond those, I've also run it on files specifically constructedto break na?ve parsers.

我没有失败的测试用例（左：）：我已经成功地在超过 100,000 个 HTML 文件上运行了这段代码——每一个我都可以快速轻松地得到我的手。除此之外，我还在专门构建的文件上运行它以破坏简单的解析器。

This is nota na?ve parser.

这不是一个简单的解析器。

Oh, I'm sure it isn't perfect, but I haven't managed to break it yet. I figure that even if something did, the fix would be easy to fit in because of the program's clear structure. Even regex-heavy programs should have stucture.

哦，我敢肯定它并不完美，但我还没有设法打破它。我认为即使发生了某些事情，由于程序的清晰结构，修复程序也很容易适应。即使是正则表达式繁重的程序也应该有结构。

Now that that's out of the way, let me address the OP's question.

既然已经解决了，让我来解决 OP 的问题。

Demo of Solving the OP's Task Using Regexes

使用正则表达式解决 OP 任务的演示

The little html_input_rxprogram I include below produces the following output, so that you can see that parsing HTML with regexes works just fine for what you wish to do:

html_input_rx我在下面包含的小程序产生以下输出，因此您可以看到使用正则表达式解析 HTML 可以很好地满足您的需求：

% html_input_rx Amazon.com-_Online_Shopping_for_Electronics,_Apparel,_Computers,_Books,_DVDs_\&_more.htm 
input tag #1 at character 9955:
       class => "searchSelect"
          id => "twotabsearchtextbox"
        name => "field-keywords"
        size => "50"
       style => "width:100%; background-color: #FFF;"
       title => "Search for"
        type => "text"
       value => ""

input tag #2 at character 10335:
         alt => "Go"
         src => "http://g-ecx.images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif"
        type => "image"

Parse Input Tags, See No Evil Input

解析输入标签，看不到恶意输入

Here's the source for the program that produced the output above.

这是产生上述输出的程序的源代码。

#!/usr/bin/env perl
#
# html_input_rx - pull out all <input> tags from (X)HTML src
#                  via simple regex processing
#
# Tom Christiansen <[email protected]>
# Sat Nov 20 10:17:31 MST 2010
#
################################################################

use 5.012;

use strict;
use autodie;
use warnings FATAL => "all";    
use subs qw{
    see_no_evil
    parse_input_tags
    input descape dequote
    load_patterns
};    
use open        ":std",
          IN => ":bytes",
         OUT => ":utf8";    
use Encode qw< encode decode >;

    ###########################################################

                        parse_input_tags 
                           see_no_evil 
                              input  

    ###########################################################

until eof(); sub parse_input_tags {
    my $_ = shift();
    our($Input_Tag_Rx, $Pull_Attr_Rx);
    my $count = 0;
    while (/$Input_Tag_Rx/pig) {
        my $input_tag = $+{TAG};
        my $place     = pos() - length ${^MATCH};
        printf "input tag #%d at character %d:\n", ++$count, $place;
        my %attr = ();
        while ($input_tag =~ /$Pull_Attr_Rx/g) {
            my ($name, $value) = @+{ qw< NAME VALUE > };
            $value = dequote($value);
            if (exists $attr{$name}) {
                printf "Discarding dup attr value '%s' on %s attr\n",
                    $attr{$name} // "<undef>", $name;
            } 
            $attr{$name} = $value;
        } 
        for my $name (sort keys %attr) {
            printf "  %10s => ", $name;
            my $value = descape $attr{$name};
            my  @Q; given ($value) {
                @Q = qw[  " "  ]  when !/'/ && !/"/;
                @Q = qw[  " "  ]  when  /'/ && !/"/;
                @Q = qw[  ' '  ]  when !/'/ &&  /"/;
                @Q = qw[ q( )  ]  when  /'/ &&  /"/;
                default { die "NOTREACHED" }
            } 
            say $Q[0], $value, $Q[1];
        } 
        print "\n";
    } 

}

sub dequote {
    my $_ = $_[0];
    s{
        (?<quote>   ["']      )
        (?<BODY>    
          (?s: (?! \k<quote> ) . ) * 
        )
        \k<quote> 
    }{$+{BODY}}six;
    return $_;
} 

sub descape {
    my $string = $_[0];
    for my $_ ($string) {
        s{
            (?<! % )
            % ( \p{Hex_Digit} {2} )
        }{
            chr hex ;
        }gsex;
        s{
            & 3 
            ( [0-9]+ )
            (?: ; 
              | (?= [^0-9] )
            )
        }{
            chr     ;
        }gsex;
        s{
            & 3 x
            ( \p{ASCII_HexDigit} + )
            (?: ; 
              | (?= \P{ASCII_HexDigit} )
            )
        }{
            chr hex ;
        }gsex;

    }
    return $string;
} 

sub input { 
    our ($RX_SUBS, $Meta_Tag_Rx);
    my $_ = do { local $/; <> };  
    my $encoding = "iso-8859-1";  # web default; wish we had the HTTP headers :(
    while (/$Meta_Tag_Rx/gi) {
        my $meta = $+{META};
        next unless $meta =~ m{             $RX_SUBS
            (?= http-equiv ) 
            (?&name) 
            (?&equals) 
            (?= (?&quote)? content-type )
            (?&value)    
        }six;
        next unless $meta =~ m{             $RX_SUBS
            (?= content ) (?&name) 
                          (?&equals) 
            (?<CONTENT>   (?&value)    )
        }six;
        next unless $+{CONTENT} =~ m{       $RX_SUBS
            (?= charset ) (?&name) 
                          (?&equals) 
            (?<CHARSET>   (?&value)    )
        }six;
        if (lc $encoding ne lc $+{CHARSET}) {
            say "[RESETTING ENCODING $encoding => $+{CHARSET}]";
            $encoding = $+{CHARSET};
        }
    } 
    return decode($encoding, $_);
}

sub see_no_evil {
    my $_ = shift();

    s{ <!    DOCTYPE  .*?         > }{}sx; 
    s{ <! \[ CDATA \[ .*?    \]\] > }{}gsx; 

    s{ <script> .*?  </script> }{}gsix; 
    s{ <!--     .*?        --> }{}gsx;

    return $_;
}

sub load_patterns { 

    our $RX_SUBS = qr{ (?(DEFINE)
        (?<nv_pair>         (?&name) (?&equals) (?&value)         ) 
        (?<name>            \b (?=  \pL ) [\w\-] + (?<= \pL ) \b  )
        (?<equals>          (?&might_white)  = (?&might_white)    )
        (?<value>           (?&quoted_value) | (?&unquoted_value) )
        (?<unwhite_chunk>   (?: (?! > ) \S ) +                    )
        (?<unquoted_value>  [\w\-] *                              )
        (?<might_white>     \s *                                  )
        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )
        (?<start_tag>  < (?&might_white) )
        (?<end_tag>          
            (?&might_white)
            (?: (?&html_end_tag) 
              | (?&xhtml_end_tag) 
             )
        )
        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )
    ) }six; 

    our $Meta_Tag_Rx = qr{                          $RX_SUBS 
        (?<META> 
            (?&start_tag) meta \b
            (?:
                (?&might_white) (?&nv_pair) 
            ) +
            (?&end_tag)
        )
    }six;

    our $Pull_Attr_Rx = qr{                         $RX_SUBS
        (?<NAME>  (?&name)      )
                  (?&equals) 
        (?<VALUE> (?&value)     )
    }six;

    our $Input_Tag_Rx = qr{                         $RX_SUBS 

        (?<TAG> (?&input_tag) )

        (?(DEFINE)

            (?<input_tag>
                (?&start_tag)
                input
                (?&might_white) 
                (?&attributes) 
                (?&might_white) 
                (?&end_tag)
            )

            (?<attributes>
                (?: 
                    (?&might_white) 
                    (?&one_attribute) 
                ) *
            )

            (?<one_attribute>
                \b
                (?&legal_attribute)
                (?&might_white) = (?&might_white) 
                (?:
                    (?&quoted_value)
                  | (?&unquoted_value)
                )
            )

            (?<legal_attribute> 
                (?: (?&optional_attribute)
                  | (?&standard_attribute)
                  | (?&event_attribute)
            # for LEGAL parse only, comment out next line 
                  | (?&illegal_attribute)
                )
            )

            (?<illegal_attribute>  (?&name) )

            (?<required_attribute> (?#no required attributes) )

            (?<optional_attribute>
                (?&permitted_attribute)
              | (?&deprecated_attribute)
            )

            # NB: The white space in string literals 
            #     below DOES NOT COUNT!   It's just 
            #     there for legibility.

            (?<permitted_attribute>
                  accept
                | alt
                | bottom
                | check box
                | checked
                | disabled
                | file
                | hidden
                | image
                | max length
                | middle
                | name
                | password
                | radio
                | read only
                | reset
                | right
                | size
                | src
                | submit
                | text
                | top
                | type
                | value
            )

            (?<deprecated_attribute>
                  align
            )

            (?<standard_attribute>
                  access key
                | class
                | dir
                | ltr
                | id
                | lang
                | style
                | tab index
                | title
                | xml:lang
            )

            (?<event_attribute>
                  on blur
                | on change
                | on click
                | on dbl   click
                | on focus
                | on mouse down
                | on mouse move
                | on mouse out
                | on mouse over
                | on mouse up
                | on key   down
                | on key   press
                | on key   up
                | on select
            )
        )
    }six;

}

UNITCHECK {
    load_patterns();
} 

END {
    close(STDOUT) 
        || die "can't close stdout: $!";
}

There you go! Nothing to it! :)

你去吧！没什么！:)

Only youcan judge whether your skill with regexes is up to any particular parsing task. Everyone's level of skill is different, and every new task is different. For jobs where you have a well-defined input set, regexes are obviously the right choice, because it is trivial to put some together when you have a restricted subset of HTML to deal with. Even regex beginners should be handle those jobs with regexes. Anything else is overkill.

只有您可以判断您的正则表达式技能是否适合任何特定的解析任务。每个人的技能水平都不一样，每一个新任务都不一样。对于具有明确定义的输入集的作业，正则表达式显然是正确的选择，因为当您需要处理有限的 HTML 子集时，将一些组合在一起是微不足道的。即使是正则表达式初学者也应该使用正则表达式处理这些工作。其他任何事情都太过分了。

However, once the HTML starts becoming less nailed down, once it starts to ramify in ways you cannot predict but which are perfectly legal, once you have to match more different sorts of things or with more complex dependencies, you will eventually reach a point where you have to work harder to effect a solution that uses regexes than you would have to using a parsing class. Where that break-even point falls depends again on your own comfort level with regexes.

然而，一旦 HTML 开始变得不那么固定，一旦它开始以您无法预测但完全合法的方式产生分支，一旦您必须匹配更多不同种类的事物或具有更复杂的依赖项，您最终将达到一个点与使用解析类相比，您必须更加努力地实现使用正则表达式的解决方案。盈亏平衡点在哪里再次取决于您自己对正则表达式的舒适程度。

So What Should I Do?

所以我该怎么做？

I'm not going to tell you what you mustdo or what you cannotdo. I think that's Wrong. I just want to present you with possibilties, open your eyes a bit. You get to choose what you want to do and how you want to do it. There are no absolutes — and nobody else knows your own situation as well as you yourself do. If something seems like it's too much work, well, maybe it is. Programming should be fun, you know. If it isn't, you may be doing it wrong.

我不会告诉你什么是你必须做的，什么是你不能做的。我认为这是错误的。我只是想给你展示一些可能性，睁开你的眼睛。你可以选择你想做什么以及你想怎么做。没有绝对的——没有人像你自己一样了解你自己的情况。如果某件事看起来工作量太大，好吧，也许确实如此。编程应该很有趣，你知道。如果不是，你可能做错了。

One can look at my html_input_rxprogram in any number of valid ways. One such is that you indeed canparse HTML with regular expressions. But another is that it is much, much, much harder than almost anyone ever thinks it is. This can easily lead to the conclusion that my program is a testament to what you should notdo, because it really is too hard.

人们可以以html_input_rx多种有效方式查看我的程序。其中之一是您确实可以使用正则表达式解析 HTML。但另一个是它比几乎任何人想象的要困难得多。这很容易得出这样的结论，我的计划是一个证明，你应该什么不能做，因为它真的是太辛苦了。

I won't disagree with that. Certainly if everything I do in my program doesn't make sense to you after some study, then you should not be attempting to use regexes for this kind of task. For specific HTML, regexes are great, but for generic HTML, they're tantamount to madness. I use parsing classes all the time, especially if it's HTML I haven't generated myself.

我不会不同意的。当然，如果经过一些研究，我在程序中所做的一切对您来说都没有意义，那么您不应该尝试将正则表达式用于此类任务。对于特定的 HTML，正则表达式很棒，但对于通用 HTML，它们无异于疯狂。我一直在使用解析类，特别是如果它是我自己没有生成的 HTML。

Regexes optimal for smallHTML parsing problems, pessimal for large ones

正则表达式对于小的HTML 解析问题是最佳的，对于大的问题是悲观的

Even if my program is taken as illustrative of why you should notuse regexes for parsing general HTML — which is OK, because I kinda meant for it to be that ? — it still should be an eye-opener so more people break the terribly common and nasty, nasty habit of writing unreadable, unstructured, and unmaintainable patterns.

即使我的程序被视为说明性的，为什么你应该不使用正则表达式解析HTML一般-这是OK的，因为我还挺意味着它是什么？- 它仍然应该令人大开眼界，因此更多的人打破了编写不可读、非结构化和不可维护模式的极其常见和讨厌的习惯。

Patterns do not have to be ugly, and they do not have to be hard. If you create ugly patterns, it is a reflection on you, not them.

模式不必很丑，也不必很硬。如果你创造出丑陋的图案，它是对你的反映，而不是它们。

Phenomenally Exquisite Regex Language

非凡精致的正则表达式语言

I've been asked to point out that my proferred solution to your problem has been written in Perl. Are you surprised? Did you not notice? Is this revelation a bombshell?

我被要求指出我对您的问题提出的解决方案是用 Perl 编写的。你惊喜吗？你没注意到吗？这个启示是重磅炸弹吗？

It is true that not all other tools and programming languages are quite as convenient, expressive, and powerful when it comes to regexes as Perl is. There's a big spectrum out there, with some being more suitable than others. In general, the languages that have expressed regexes as part of the core language instead of as a library are easier to work with. I've done nothing with regexes that you couldn't do in, say, PCRE, although you would structure the program differently if you were using C.

确实，并非所有其他工具和编程语言在处理正则表达式时都像 Perl 一样方便、富有表现力和强大。那里有很大的范围，有些比其他的更合适。通常，将正则表达式表示为核心语言的一部分而不是库的语言更易于使用。我没有用正则表达式做任何你在 PCRE 中不能做的事情，尽管如果你使用 C，你会以不同的方式构建程序。

Eventually other languages will be catch up with where Perl is now in terms of regexes. I say this because back when Perl started, nobody else had anything like Perl's regexes. Say anything you like, but this is where Perl clearly won: everybody copied Perl's regexes albeit at varying stages of their development. Perl pioneered almost (not quite all, but almost) everything that you have come to rely on in modern patterns today, no matter what tool or language you use. So eventually the others willcatch up.

最终，其他语言将在正则表达式方面赶上 Perl 现在的位置。我这么说是因为在 Perl 刚开始的时候，没有人拥有像 Perl 的正则表达式这样的东西。随便说点什么，但这就是 Perl 明显获胜的地方：每个人都复制了 Perl 的正则表达式，尽管它们处于不同的开发阶段。Perl 开创了当今现代模式所依赖的几乎（不是全部，而是几乎）一切，无论您使用什么工具或语言。所以最终其他人会赶上。

But they'll only catch up to where Perl was sometime in the past, just as it is now. Everything advances. In regexes if nothing else, where Perl leads, others follow. Where will Perl be once everybody else finally catches up to where Perl is now? I have no idea, but I know we too will have moved. Probably we'll be closer to Perl?'s style of crafting patterns.

但是他们只会赶上 Perl 过去某个时候的水平，就像现在一样。一切都在前进。在正则表达式中，如果没有别的，Perl 领先，其他人跟随。一旦其他人最终赶上 Perl 现在的位置，Perl 会在哪里？我不知道，但我知道我们也会搬家。可能我们会更接近Perl? 的制作模式风格。

If you like that kind of thing but would like to use it in Perl?, you might be interested in Damian Conway's wonderfulRegexp::Grammarsmodule. It's completely awesome, and makes what I've done here in my program seem just as primitive as mine makes the patterns that people cram together without whitespace or alphabetic identifiers. Check it out!

如果您喜欢这种东西，但想在 Perl 中使用它？，您可能会对Damian Conway精彩的Regexp::Grammars模块感兴趣。这真是太棒了，而且让我在我的程序中所做的看起来和我的程序一样原始，人们在没有空格或字母标识符的情况下挤在一起的模式。一探究竟！

Simple HTML Chunker

简单的 HTML 块

Here is the complete source to the parser I showed the centerpiece from at the beginning of this posting.

这是我在本文开头展示的核心部分的解析器的完整源代码。

I am notsuggesting that you should use this over a rigorously tested parsing class. But I am tired of people pretending that nobody can parse HTML with regexes just because theycan't. You clearly can, and this program is proof of that assertion.

我并不是建议您应该在经过严格测试的解析类上使用它。但是我已经厌倦了人们假装没有人可以使用正则表达式解析 HTML 只是因为他们不能。您显然可以，并且该程序证明了该断言。

Sure, it isn't easy, but it ispossible!

当然，这是不容易的，但它是可能的！

And trying to do so is a terrible waste of time, because good parsing classes exist which you shoulduse for this task. The right answer to people trying to parse arbitraryHTML is notthat it is impossible. That is a facile and disingenuous answer. The correct and honest answer is that they shouldn't attempt it because it is too much of a bother to figure out from scratch; they should not break their back striving to re?nvent a wheel that works perfectly well.

尝试这样做是在浪费时间，因为存在良好的解析类，您应该将其用于此任务。人们试图解析任意HTML的正确答案并不是不可能。这是一个轻率而虚伪的回答。正确而诚实的答案是他们不应该尝试它，因为从头开始计算太麻烦了；他们不应该为了重新发明一个运行良好的轮子而折断他们的背部。

On the other hand, HTML that falls within a predicable subsetis ultra-easy to parse with regexes. It's no wonder people try to use them, because for small problems, toy problems perhaps, nothing could be easier. That's why it's so important to distinguish the two tasks — specific vs generic — as these do not necessarily demand the same approach.

另一方面，属于可预测子集的HTML非常容易用正则表达式解析。难怪人们尝试使用它们，因为对于小问题，也许是玩具问题，没有什么比这更容易的了。这就是区分这两个任务（特定与通用）如此重要的原因，因为它们不一定需要相同的方法。

I hope in the future here to see a more fair and honest treatment of questions about HTML and regexes.

我希望将来在这里看到对有关 HTML 和正则表达式的问题的更公平和诚实的处理。

Here's my HTML lexer. It doesn't try to do a validating parse; it just identifies the lexical elements. You might think of it more as an HTML chunkerthan an HTML parser. It isn't very forgiving of broken HTML, although it makes some very small allowances in that direction.

这是我的 HTML 词法分析器。它不会尝试进行验证解析；它只是标识词汇元素。您可能会认为它更像是 HTML 分块器而不是 HTML 解析器。它对损坏的 HTML 不是很宽容，尽管它在这个方向上做了一些非常小的允许。

Even if you never parse full HTML yourself (and why should you? it's a solved problem!), this program has lots of cool regex bits that I believe a lot of people can learn a lot from. Enjoy!

即使您从不自己解析完整的 HTML（为什么要解析？这是一个已解决的问题！），该程序也有很多很酷的正则表达式位，我相信很多人都可以从中学到很多东西。享受！

#!/usr/bin/env perl
#
# chunk_HTML - a regex-based HTML chunker
#
# Tom Christiansen <[email protected]
#   Sun Nov 21 19:16:02 MST 2010
########################################

use 5.012;

use strict;
use autodie;
use warnings qw< FATAL all >;
use open     qw< IN :bytes OUT :utf8 :std >;

MAIN: {
  $| = 1;
  lex_html(my $page = slurpy());
  exit();
}

########################################################################
sub lex_html {
    our $RX_SUBS;                                        ###############
    my  $html = shift();                                 # Am I...     #
    for (;;) {                                           # forgiven? :)#
        given ($html) {                                  ###############
            last                when (pos || 0) >= length;
            printf "\@%d=",          (pos || 0);
            print  "doctype "   when / \G (?&doctype)  $RX_SUBS  /xgc;
            print  "cdata "     when / \G (?&cdata)    $RX_SUBS  /xgc;
            print  "xml "       when / \G (?&xml)      $RX_SUBS  /xgc;
            print  "xhook "     when / \G (?&xhook)    $RX_SUBS  /xgc;
            print  "script "    when / \G (?&script)   $RX_SUBS  /xgc;
            print  "style "     when / \G (?&style)    $RX_SUBS  /xgc;
            print  "comment "   when / \G (?&comment)  $RX_SUBS  /xgc;
            print  "tag "       when / \G (?&tag)      $RX_SUBS  /xgc;
            print  "untag "     when / \G (?&untag)    $RX_SUBS  /xgc;
            print  "nasty "     when / \G (?&nasty)    $RX_SUBS  /xgc;
            print  "text "      when / \G (?&nontag)   $RX_SUBS  /xgc;
            default {
                die "UNCLASSIFIED: " .
                  substr($_, pos || 0, (length > 65) ? 65 : length);
            }
        }
    }
    say ".";
}
#####################
# Return correctly decoded contents of next complete
# file slurped in from the <ARGV> stream.
#
sub slurpy {
    our ($RX_SUBS, $Meta_Tag_Rx);
    my $_ = do { local $/; <ARGV> };   # read all input

    return unless length;

    use Encode   qw< decode >;

    my $bom = "";
    given ($_) {
        $bom = "UTF-32LE" when / ^ \xFf \xFe <?php

$d = new DOMDocument();
$d->loadHTML(
    '
    <p>fsdjl</p>
    <form><div>fdsjl</div></form>
    <input type="hidden" name="blah" value="hide yo kids">
    <input type="text" name="blah" value="hide yo kids">
    <input type="hidden" name="blah" value="hide yo wife">
');
$x = new DOMXpath($d);
$inputs = $x->evaluate('//input[@type="hidden"]');

foreach ( $inputs as $input ) {
    echo $input->getAttribute('value'), '<br>';
}
   hide yo kids<br>hide yo wife<br>
   /x;  # LE
        $bom = "UTF-32BE" when / ^ <input[^>]*type="hidden"[^>]*>
   <[A-Za-z ="/_0-9+]*>
   \xFe \xFf /x;  #   BE
        $bom = "UTF-16LE" when / ^ \xFf \xFe           /x;  # le
        $bom = "UTF-16BE" when / ^ \xFe \xFf           /x;  #   be
        $bom = "UTF-8"    when / ^ \xEF \xBB \xBF      /x;  # st00pid
    }
    if ($bom) {
        say "[BOM $bom]";
        s/^...// if $bom eq "UTF-8";                        # st00pid

        # Must use UTF-(16|32) w/o -[BL]E to strip BOM.
        $bom =~ s/-[LB]E//;

        return decode($bom, $_);

        # if BOM found, don't fall through to look
        #  for embedded encoding spec
    }

    # Latin1 is web default if not otherwise specified.
    # No way to do this correctly if it was overridden
    # in the HTTP header, since we assume stream contains
    # HTML only, not also the HTTP header.
    my $encoding = "iso-8859-1";
    while (/ (?&xml) $RX_SUBS /pgx) {
        my $xml = ${^MATCH};
        next unless $xml =~ m{              $RX_SUBS
            (?= encoding )  (?&name)
                            (?&equals)
                            (?&quote) ?
            (?<ENCODING>    (?&value)       )
        }sx;
        if (lc $encoding ne lc $+{ENCODING}) {
            say "[XML ENCODING $encoding => $+{ENCODING}]";
            $encoding = $+{ENCODING};
        }
    }

    while (/$Meta_Tag_Rx/gi) {
        my $meta = $+{META};

        next unless $meta =~ m{             $RX_SUBS
            (?= http-equiv )    (?&name)
                                (?&equals)
            (?= (?&quote)? content-type )
                                (?&value)
        }six;

        next unless $meta =~ m{             $RX_SUBS
            (?= content )       (?&name)
                                (?&equals)
            (?<CONTENT>         (?&value)    )
        }six;

        next unless $+{CONTENT} =~ m{       $RX_SUBS
            (?= charset )       (?&name)
                                (?&equals)
            (?<CHARSET>         (?&value)    )
        }six;

        if (lc $encoding ne lc $+{CHARSET}) {
            say "[HTTP-EQUIV ENCODING $encoding => $+{CHARSET}]";
            $encoding = $+{CHARSET};
        }
    }

    return decode($encoding, $_);
}
########################################################################
# Make sure to this function is called
# as soon as source unit has been compiled.
UNITCHECK { load_rxsubs() }

# useful regex subroutines for HTML parsing
sub load_rxsubs {

    our $RX_SUBS = qr{
      (?(DEFINE)

        (?<WS> \s *  )

        (?<any_nv_pair>     (?&name) (?&equals) (?&value)         )
        (?<name>            \b (?=  \pL ) [\w:\-] +  \b           )
        (?<equals>          (?&WS)  = (?&WS)    )
        (?<value>           (?&quoted_value) | (?&unquoted_value) )
        (?<unwhite_chunk>   (?: (?! > ) \S ) +                    )

        (?<unquoted_value>  [\w:\-] *                             )

        (?<any_quote>  ["']      )

        (?<quoted_value>
            (?<quote>   (?&any_quote)  )
            (?: (?! \k<quote> ) . ) *
            \k<quote>
        )

        (?<start_tag>       < (?&WS)      )
        (?<html_end_tag>      >           )
        (?<xhtml_end_tag>   / >           )
        (?<end_tag>
            (?&WS)
            (?: (?&html_end_tag)
              | (?&xhtml_end_tag) )
         )

        (?<tag>
            (?&start_tag)
            (?&name)
            (?:
                (?&WS)
                (?&any_nv_pair)
            ) *
            (?&end_tag)
        )

        (?<untag> </ (?&name) > )

        # starts like a tag, but has screwed up quotes inside it
        (?<nasty>
            (?&start_tag)
            (?&name)
            .*?
            (?&end_tag)
        )

        (?<nontag>    [^<] +            )

        (?<string> (?&quoted_value)     )
        (?<word>   (?&name)             )

        (?<doctype>
            <!DOCTYPE
                # please don't feed me nonHTML
                ### (?&WS) HTML
            [^>]* >
        )

        (?<cdata>   <!\[CDATA\[     .*?     \]\]    > )
        (?<script>  (?= <script ) (?&tag)   .*?     </script> )
        (?<style>   (?= <style  ) (?&tag)   .*?     </style> )
        (?<comment> <!--            .*?           --> )

        (?<xml>
            < \? xml
            (?:
                (?&WS)
                (?&any_nv_pair)
            ) *
            (?&WS)
            \? >
        )

        (?<xhook> < \? .*? \? > )

      )

    }six;

    our $Meta_Tag_Rx = qr{                          $RX_SUBS
        (?<META>
            (?&start_tag) meta \b
            (?:
                (?&WS) (?&any_nv_pair)
            ) +
            (?&end_tag)
        )
    }six;

}

# nobody *ever* remembers to do this!
END { close STDOUT }

Answer 3

回答by meder omuraliev

You can write a novel like tchrist did
You can use a DOM library, load the HTML and use xpath and just use //input[@type="hidden"]. Or if you don't want to use xpath, just get all inputs and filter which ones are hidden with getAttribute.

你可以像tchrist那样写小说
您可以使用 DOM 库，加载 HTML 并使用 xpath，只需使用//input[@type="hidden"]. 或者，如果您不想使用 xpath，只需获取所有输入并过滤哪些是隐藏的getAttribute。

I prefer #2.

我更喜欢#2。

<[ ]*input[ ]+type="hidden"[ ]*name=[A-Za-z ="_0-9+]*[ ]*[/]*>

Result:

结果：

<input type="hidden" name="SaveRequired" value="False" /><input type="hidden" name="__VIEWSTATE1" value="1H4sIAAtzrkX7QfL5VEGj6nGi+nP" /><input type="hidden" name="__VIEWSTATE2" value="0351118MK" /><input type="hidden" name="__VIEWSTATE3" value="ZVVV91yjY" />

Answer 4

回答by David

In the spirit of Tom Christiansen's lexer solution, here's a link to Robert Cameron's seemingly forgotten 1998 article, REX: XML Shallow Parsing with Regular Expressions.

本着 Tom Christiansen 的词法分析器解决方案的精神，这里是 Robert Cameron 在 1998 年似乎被遗忘的文章REX：使用正则表达式进行 XML 浅层解析的链接。

http://www.cs.sfu.ca/~cameron/REX.html

Abstract
The syntax of XML is simple enough that it is possible to parse an XML document into a list of its markup and text items using a single regular expression. Such a shallow parse of an XML document can be very useful for the construction of a variety of lightweight XML processing tools. However, complex regular expressions can be difficult to construct and even more difficult to read. Using a form of literate programming for regular expressions, this paper documents a set of XML shallow parsing expressions that can be used a basis for simple, correct, efficient, robust and language-independent XML shallow parsing. Complete shallow parser implementations of less than 50 lines each in Perl, JavaScript and Lex/Flex are also given.

抽象的
XML 的语法非常简单，可以使用单个正则表达式将 XML 文档解析为其标记和文本项的列表。XML 文档的这种浅层解析对于构建各种轻量级 XML 处理工具非常有用。然而，复杂的正则表达式可能难以构建，甚至更难以阅读。本文使用正则表达式的文学编程形式，记录了一组 XML 浅层解析表达式，可用作简单、正确、高效、健壮且与语言无关的 XML 浅层解析的基础。还给出了在 Perl、JavaScript 和 Lex/Flex 中各不到 50 行的完整浅层解析器实现。

If you enjoy reading about regular expressions, Cameron's paper is fascinating. His writing is concise, thorough, and very detailed. He's not simply showing you how to construct the REX regular expression but also an approach for building up any complex regex from smaller parts.

如果您喜欢阅读有关正则表达式的文章，Cameron 的论文很有趣。他的文字简洁、透彻，而且非常详细。他不仅向您展示了如何构建 REX 正则表达式，而且还向您展示了一种从较小部分构建任何复杂正则表达式的方法。

I've been using the REX regular expression on and off for 10 years to solve the sort of problem the initial poster asked about (how do I match this particular tag but not some other very similar tag?). I've found the regex he developed to be completely reliable.

10 年来，我一直在断断续续地使用 REX 正则表达式来解决最初发布者提出的问题（我如何匹配这个特定的标签而不是其他一些非常相似的标签？）。我发现他开发的正则表达式完全可靠。

REX is particularly useful when you're focusing on lexical details of a document -- for example, when transforming one kind of text document (e.g., plain text, XML, SGML, HTML) into another, where the document may not be valid, well formed, or even parsable for most of the transformation. It lets you target islands of markup anywhere within a document without disturbing the rest of the document.

当您关注文档的词法细节时，REX 特别有用——例如，当将一种文本文档（例如，纯文本、XML、SGML、HTML）转换为另一种文档时，文档可能无效，格式良好，甚至可以解析大部分转换。它使您可以在文档中的任何位置定位标记岛，而不会干扰文档的其余部分。

Answer 5

回答by Suamere

While I love the contents of the rest of these answers, they didn't really answer the question directly or as correctly. Even Platinum's answer was overly complicated, and also less efficient. So I was forced to put this.

虽然我喜欢这些答案其余部分的内容，但他们并没有真正直接或正确地回答问题。就连Platinum 的回答也过于复杂，而且效率也很低。所以我被迫把这个。

I'm a huge proponent of Regex, when used correctly. But because of stigma (and performance), I always state that well-formed XML or HTML should use an XML Parser. And even better performance would be string-parsing, though there's a line between readability if that gets too out-of-hand. However, that isn't the question. The question is how to match a hidden-type input tag. The answer is:

如果使用得当，我是 Regex 的大力支持者。但是由于耻辱（和性能），我总是声明格式良好的 XML 或 HTML 应该使用 XML 解析器。甚至更好的性能将是字符串解析，尽管如果这太失控，可读性之间会有一条线。然而，这不是问题。问题是如何匹配隐藏类型的输入标签。答案是：

<[ ]*input[ ]*[A-Za-z ="_0-9+/]*>

Depending on your flavor, the only regex option you'd need to include is the ignorecase option.

根据您的风格，您需要包含的唯一正则表达式选项是 ignorecase 选项。

Answer 6

回答by Shamshirsaz.Navid

you can try this :

你可以试试这个：

<[ ]*input[ ]*[A-Za-z ="_0-9+/]*[ ]*[/]>

and for closer result you can try this :

为了更接近的结果，你可以试试这个：

<input  name="SaveRequired" type="hidden" value="False" /><input type="hidden" name="__VIEWSTATE1" value="1H4sIAAtzrkX7QfL5VEGj6nGi+nP" /><input type="hidden" name="__VIEWSTATE2" value="0351118MK" /><input  name="__VIEWSTATE3" type="hidden" value="ZVVV91yjY" />

you can test your regex pattern here http://regexpal.com/

你可以在这里测试你的正则表达式模式http://regexpal.com/

these pattens are good for this:

这些模式对此有好处：

<[ ]*input(([ ]*type="hidden"[ ]*name=[A-Za-z0-9_+"]*[ ]*value=[A-Za-z0-9_+"]*[ ]*)+)[ ]*/>|<[ ]*input(([ ]*type="hidden"[ ]*value=[A-Za-z0-9_+"]*[ ]*name=[A-Za-z0-9_+"]*[ ]*)+)[ ]*/>|<[ ]*input(([ ]*name=[A-Za-z0-9_+"]*[ ]*type="hidden"[ ]*value=[A-Za-z0-9_+"]*[ ]*)+)[ ]*/>|<[ ]*input(([ ]*value=[A-Za-z0-9_+"]*[ ]*type="hidden"[ ]*name=[A-Za-z0-9_+"]*[ ]*)+)[ ]*/>|<[ ]*input(([ ]*name=[A-Za-z0-9_+"]*[ ]*value=[A-Za-z0-9_+"]*[ ]*type="hidden"[ ]*)+)[ ]*/>|<[ ]*input(([ ]*value=[A-Za-z0-9_+"]*[ ]*name=[A-Za-z0-9_+"]*[ ]*type="hidden"[ ]*)+)[ ]*/>

and for random order of type, nameand valueu can use this :

并为随机顺序type，name和valueu可以使用这样的：

$dom = new DOMDocument();
$dom ->loadHTML($input);
$x = new DOMXpath($dom );
$results = $x->evaluate('//input[@type="hidden"]');

foreach ( $results as $item) {
    print_r( $item->getAttribute('value') );
}

or

或者

var regex = /(<input.*?type\s?=\s?["']hidden["'].*?>)/g;
html.match(regex);

on this :

对此：

##代码##

`

by the way i think you want something like this :

顺便说一下，我认为你想要这样的东西：

##代码##

its not good but it works in any way.

它不好，但它以任何方式工作。

test it in : http://regexpal.com/

测试它：http: //regexpal.com/

Answer 7

回答by HTML5 developer

I would like to use **DOMDocument**to extract the html code.

我想**DOMDocument**用来提取html代码。

##代码##

BTW, you can test it in here - regex101.com. It shows the result at real time. Some rules about Regexp: http://www.eclipse.org/tptp/home/downloads/installguide/gla_42/ref/rregexp.html Reader.

顺便说一句，你可以在这里测试它 - regex101.com。它实时显示结果。关于 Regexp 的一些规则：http://www.eclipse.org/tptp/home/downloads/installguide/gla_42/ref/rregexp.html Reader。

Answer 8

回答by Nitin9791

suppose your html content is stored in string html then in order to get every input that contain type hidden you can use regular expression

假设您的 html 内容存储在字符串 html 中，那么为了获取包含隐藏类型的每个输入，您可以使用正则表达式

##代码##

the above regex find <inputfollowed by any number of characters until it gets type="hidden"or type='hidden' followed by any number of characters till it gets >

上面的正则表达式 find<input后跟任意数量的字符直到它得到type="hidden"或 type='hidden' 后跟任意数量的字符直到它得到>

/g tell regular expression to find every substring that matches to the given pattern.

/g 告诉正则表达式查找与给定模式匹配的每个子字符串。

Html 怎么办正则表达式模式在字符串中的任何地方都不匹配？

提问by Salman

采纳答案by Platinum Azure

回答by tchrist

Oh Yes You CanUse Regexes to Parse HTML!

哦，是的，您可以使用正则表达式来解析 HTML！

General Regex-Based HTML Parsing Solutions

基于正则表达式的通用 HTML 解析解决方案

Demo of Solving the OP's Task Using Regexes

使用正则表达式解决 OP 任务的演示

Parse Input Tags, See No Evil Input

解析输入标签，看不到恶意输入

So What Should I Do?

所以我该怎么做？

Regexes optimal for smallHTML parsing problems, pessimal for large ones

正则表达式对于小的HTML 解析问题是最佳的，对于大的问题是悲观的

Phenomenally Exquisite Regex Language

非凡精致的正则表达式语言

Simple HTML Chunker

简单的 HTML 块

回答by meder omuraliev

回答by David

回答by Suamere

回答by Shamshirsaz.Navid

回答by HTML5 developer

回答by Nitin9791

相关推荐

最近更新

标签

Html 怎么办 正则表达式模式在字符串中的任何地方都不匹配？

提问by Salman

采纳答案by Platinum Azure

回答by tchrist

Oh Yes You CanUse Regexes to Parse HTML!

哦，是的，您可以使用正则表达式来解析 HTML！

General Regex-Based HTML Parsing Solutions

基于正则表达式的通用 HTML 解析解决方案

Demo of Solving the OP's Task Using Regexes

使用正则表达式解决 OP 任务的演示

Parse Input Tags, See No Evil Input

解析输入标签，看不到恶意输入

So What Should I Do?

所以我该怎么做？

Regexes optimal for smallHTML parsing problems, pessimal for large ones

正则表达式对于小的HTML 解析问题是最佳的，对于大的问题是悲观的

Phenomenally Exquisite Regex Language

非凡精致的正则表达式语言

Simple HTML Chunker

简单的 HTML 块

回答by meder omuraliev

回答by David

回答by Suamere

回答by Shamshirsaz.Navid

回答by HTML5 developer

回答by Nitin9791

相关推荐

Html 表单的背景

Html 如何通过悬停更改 li 标签的背景颜色？

Html 使用 HTML5 阅读 red5 直播

Html 如何在某个 div 内设置所有锚标签的样式？

相关推荐

最近更新

标签

Html 怎么办正则表达式模式在字符串中的任何地方都不匹配？