php 关闭字符串中打开的 HTML 标签

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3810230/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 11:11:55  来源:igfitidea点击:

Close open HTML tags in a string

phpregexstring

提问by Ahmad Fouad

Situation is a string that results in something like this:

Situation 是一个字符串,结果如下:

<p>This is some text and here is a <strong>bold text then the post stop here....</p>

Because the function returns a teaser (summary) of the text, it stops after certain words. Where in this case the tag strong is not closed. But the whole string is wrapped in a paragraph.

因为该函数返回文本的预告(摘要),所以它在某些词之后停止。在这种情况下,标签 strong 没有关闭。但是整个字符串都包裹在一个段落中。

Is it possible to convert the above result/output to the following:

是否可以将上述结果/输出转换为以下内容:

<p>This is some text and here is a <strong>bold text then the post stop here....</strong></p>

I do not know where to begin. The problem is that.. I found a function on the web which does it regex, but it puts the closing tag after the string.. therefore it won't validate because I want all open/close tags within the paragraph tags. The function I found does this which is wrong also:

我不知道从哪里开始。问题是.. 我在网上找到了一个函数,它执行正则表达式,但它把结束标签放在字符串之后.. 因此它不会验证,因为我想要段落标签中的所有打开/关闭标签。我发现的功能这样做也是错误的:

<p>This is some text and here is a <strong>bold text then the post stop here....</p></strong>

I want to know that the tag can be strong, italic, anything. That's why I cannot append the function and close it manually in the function. Any pattern that can do it for me?

我想知道标签可以是强的,斜体的,任何东西。这就是为什么我不能附加函数并在函数中手动关闭它。任何可以为我做的模式?

回答by alexn

Here is a function i've used before, which works pretty well:

这是我以前使用过的一个函数,效果很好:

function closetags($html) {
    preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
    $openedtags = $result[1];
    preg_match_all('#</([a-z]+)>#iU', $html, $result);
    $closedtags = $result[1];
    $len_opened = count($openedtags);
    if (count($closedtags) == $len_opened) {
        return $html;
    }
    $openedtags = array_reverse($openedtags);
    for ($i=0; $i < $len_opened; $i++) {
        if (!in_array($openedtags[$i], $closedtags)) {
            $html .= '</'.$openedtags[$i].'>';
        } else {
            unset($closedtags[array_search($openedtags[$i], $closedtags)]);
        }
    }
    return $html;
} 

Personally though, I would not do it using regexp but a library such as Tidy. This would be something like the following:

不过就我个人而言,我不会使用正则表达式,而是使用 Tidy 之类的库。这将类似于以下内容:

$str = '<p>This is some text and here is a <strong>bold text then the post stop here....</p>';
$tidy = new Tidy();
$clean = $tidy->repairString($str, array(
    'output-xml' => true,
    'input-xml' => true
));
echo $clean;

回答by Markus

A small modification to the original answer...while the original answer stripped tags correctly. I found that during my truncation, I could end up with chopped up tags. For example:

对原始答案的小修改......而原始答案正确地剥离了标签。我发现在截断过程中,我可能会得到切碎的标签。例如:

This text has some <b>in it</b>

Truncating at character 21 results in:

在字符 21 处截断会导致:

This text has some <

The following code, builds on the next best answer and fixes this.

以下代码基于下一个最佳答案并修复了此问题。

function truncateHTML($html, $length)
{
    $truncatedText = substr($html, $length);
    $pos = strpos($truncatedText, ">");
    if($pos !== false)
    {
        $html = substr($html, 0,$length + $pos + 1);
    }
    else
    {
        $html = substr($html, 0,$length);
    }

    preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
    $openedtags = $result[1];

    preg_match_all('#</([a-z]+)>#iU', $html, $result);
    $closedtags = $result[1];

    $len_opened = count($openedtags);

    if (count($closedtags) == $len_opened)
    {
        return $html;
    }

    $openedtags = array_reverse($openedtags);
    for ($i=0; $i < $len_opened; $i++)
    {
        if (!in_array($openedtags[$i], $closedtags))
        {
            $html .= '</'.$openedtags[$i].'>';
        }
        else
        {
            unset($closedtags[array_search($openedtags[$i], $closedtags)]);
        }
    }


    return $html;
}


$str = "This text has <b>bold</b> in it</b>";
print "Test 1 - Truncate with no tag: " . truncateHTML($str, 5) . "<br>\n";
print "Test 2 - Truncate at start of tag: " . truncateHTML($str, 20) . "<br>\n";
print "Test 3 - Truncate in the middle of a tag: " . truncateHTML($str, 16) . "<br>\n";
print "Test 4: - Truncate with less text: " . truncateHTML($str, 300) . "<br>\n";

Hope it helps someone out there.

希望它可以帮助那里的人。

回答by Andrew

This PHP method always worked for me. It will close all un-closed HTML tags.

这种 PHP 方法总是对我有用。它将关闭所有未关闭的 HTML 标签。

function closetags($html) {
    preg_match_all('#<([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
    $openedtags = $result[1];

    preg_match_all('#</([a-z]+)>#iU', $html, $result);
    $closedtags = $result[1];
    $len_opened = count($openedtags);
    if (count($closedtags) == $len_opened) {
        return $html;
    }
    $openedtags = array_reverse($openedtags);
    for ($i=0; $i < $len_opened; $i++) {
        if (!in_array($openedtags[$i], $closedtags)){
            $html .= '</'.$openedtags[$i].'>';
        } else {
            unset($closedtags[array_search($openedtags[$i], $closedtags)]);
        }
    }
    return $html;
}

回答by Russell Dias

There are numerous other variables that need to be addressed to give a full solution, but are not covered by your question.

还有许多其他变量需要解决才能提供完整的解决方案,但您的问题未涵盖。

However, I would suggest using something like HTML Tidyand in particular the repairFileor repaireStringmethods.

但是,我建议使用HTML Tidy 之类的东西,尤其是repairFileorrepaireString方法。

回答by Pavel ?íha

And what about using PHP's native DOMDocument class? It inherently parses HTML and corrects syntax errors... E.g.:

那么如何使用 PHP 的原生 DOMDocument 类呢?它本质上解析 HTML 并纠正语法错误......例如:

$fragment = "<article><h3>Title</h3><p>Unclosed";
$doc = new DOMDocument();
$doc->loadHTML($fragment);
$correctFragment = $doc->getElementsByTagName('body')->item(0)->C14N();
echo $correctFragment;

However, there are several disadvantages of this approach. Firstly, it wraps the original fragment within the <body>tag. You can get rid of it easily by something like (preg_)replace() or by substituting the ...->C14N()function by some custom innerHTML() function, as suggested for example at http://php.net/manual/en/book.dom.php#89718. The second pitfall is that PHP throws an 'invalid tag in Entity' warning if HTML5 or custom tags are used (nevertheless, it will still proceed correctly).

然而,这种方法有几个缺点。首先,它将原始片段包装在<body>标签中。您可以通过 (preg_)replace() 之类的方法轻松摆脱它,或者...->C14N()通过一些自定义的 innerHTML() 函数替换该函数,例如http://php.net/manual/en/book.dom 中的建议。 php#89718。第二个陷阱是,如果使用 HTML5 或自定义标签,PHP 会抛出“实体中的无效标签”警告(尽管如此,它仍然会正确进行)。

回答by Karan Patel

This is works for me to close any open HTML tags in a script.

这对我来说可以关闭脚本中任何打开的 HTML 标签。

<?php
function closetags($html) {
preg_match_all('#<([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
$openedtags = $result[1];
preg_match_all('#</([a-z]+)>#iU', $html, $result);
$closedtags = $result[1];
$len_opened = count($openedtags);
if (count($closedtags) == $len_opened) {
    return $html;
}
$openedtags = array_reverse($openedtags);
for ($i=0; $i < $len_opened; $i++) {
    if (!in_array($openedtags[$i], $closedtags)) {
        $html .= '</'.$openedtags[$i].'>';
    } else {
        unset($closedtags[array_search($openedtags[$i], $closedtags)]);
    }
}
return $html;
}

回答by Luca C.

if tidy module is installed, use php tidy extension:

如果安装了 tidy 模块,请使用 php tidy 扩展:

tidy_repair_string($html)

reference

参考

回答by Poupoudoum

I've done this code witch doest the job quite correctly...

我已经完成了这个代码女巫的工作非常正确......

It's old school but efficient and I've added a flag to remove the unfinished tags such as " blah blah http://stackoverfl"

它是老派但高效,我添加了一个标志来删除未完成的标签,例如“blah blah http://stackoverfl”

public function getOpennedTags(&$string, $removeInclompleteTagEndTagIfExists = true) {

    $tags = array();
    $tagOpened = false;
    $tagName = '';
    $tagNameLogged = false;
    $closingTag = false;

    foreach (str_split($string) as $c) {
        if ($tagOpened && $c == '>') {
            $tagOpened = false;
            if ($closingTag) {
                array_pop($tags);
                $closingTag = false;
                $tagName = '';
            }
            if ($tagName) {
                array_push($tags, $tagName);
            }
        }
        if ($tagOpened && $c == ' ') {
            $tagNameLogged = true;
        }
        if ($tagOpened && $c == '/') {
            if ($tagName) {
                //orphan tag
                $tagOpened = false;
                $tagName = '';
            } else {
                //closingTag
                $closingTag = true;
            }
        }
        if ($tagOpened && !$tagNameLogged) {
            $tagName .= $c;
        }
        if (!$tagOpened && $c == '<') {
            $tagNameLogged = false;
            $tagName = '';
            $tagOpened = true;
            $closingTag = false;
        }
    }

    if ($removeInclompleteTagEndTagIfExists && $tagOpened) {
        // an tag has been cut for exemaple ' blabh blah <a href="sdfoefzofk' so closing the tag will not help...
        // let's remove this ugly piece of tag
        $pos = strrpos($string, '<');
        $string = substr($string, 0, $pos);
    }

    return $tags;
}

Usage example :

用法示例:

$tagsToClose = $stringHelper->getOpennedTags($val);
$tagsToClose = array_reverse($tagsToClose);

foreach ($tagsToClose as $tag) {
    $val .= "</$tag>";
}

回答by JoshD

Using a regular expression isn't an ideal approach for this. You should use an html parser instead to create a valid document object model.

使用正则表达式不是一个理想的方法。您应该使用 html 解析器来创建有效的文档对象模型。

As a second option, depending on what you want, you could use a regex to remove any and all html tags from your string before you put it in the <p>tag.

作为第二个选项,根据您的需要,您可以使用正则表达式从字符串中删除任何和所有 html 标签,然后再将其放入<p>标签中。