php 截断包含 HTML 的文本，忽略标签

Question

提问by SamWM

I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues).

我想截断一些文本（从数据库或文本文件加载），但它包含 HTML，因此包含标签并返回较少的文本。这可能会导致标签未关闭或部分关闭（因此 Tidy 可能无法正常工作并且内容仍然较少）。我如何根据文本截断（并且可能在您到达表格时停止，因为这可能会导致更复杂的问题）。

substr("Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m a web developer.",0,26)."..."

Would result in:

会导致：

Hello, my <strong>name</st...

What I would want is:

我想要的是：

Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m...

How can I do this?

我怎样才能做到这一点？

While my question is for how to do it in PHP, it would be good to know how to do it in C#... either should be OK as I think I would be able to port the method over (unless it is a built in method).

虽然我的问题是如何在 PHP 中做到这一点，但最好知道如何在 C# 中做到这一点......要么应该没问题，因为我认为我可以移植该方法（除非它是内置的方法）。

Also note that I have included an HTML entity ´- which would have to be considered as a single character (rather than 7 characters as in this example).

另请注意，我包含了一个 HTML 实体´- 必须将其视为单个字符（而不是本示例中的 7 个字符）。

strip_tagsis a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.

strip_tags是一种后备方法，但我会丢失格式和链接，而且 HTML 实体仍然存在问题。

Answer 1

回答by S?ren L?vborg

Assuming you are using valid XHTML, it's simple to parse the HTML and make sure tags are handled properly. You simply need to track which tags have been opened so far, and make sure to close them again "on your way out".

假设您使用的是有效的 XHTML，解析 HTML 并确保正确处理标签很简单。您只需要跟踪到目前为止已打开哪些标签，并确保“在您离开时”再次关闭它们。

<?php
header('Content-type: text/plain; charset=utf-8');

function printTruncated($maxLength, $html, $isUtf8=true)
{
    $printedLength = 0;
    $position = 0;
    $tags = array();

    // For UTF-8, we need to count multibyte sequences as one character.
    $re = $isUtf8
        ? '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}'
        : '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}';

    while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($tag, $tagPosition) = $match[0];

        // Print text leading up to the tag.
        $str = substr($html, $position, $tagPosition - $position);
        if ($printedLength + strlen($str) > $maxLength)
        {
            print(substr($str, 0, $maxLength - $printedLength));
            $printedLength = $maxLength;
            break;
        }

        print($str);
        $printedLength += strlen($str);
        if ($printedLength >= $maxLength) break;

        if ($tag[0] == '&' || ord($tag) >= 0x80)
        {
            // Pass the entity or UTF-8 multibyte sequence through unchanged.
            print($tag);
            $printedLength++;
        }
        else
        {
            // Handle the tag.
            $tagName = $match[1][0];
            if ($tag[1] == '/')
            {
                // This is a closing tag.

                $openingTag = array_pop($tags);
                assert($openingTag == $tagName); // check that tags are properly nested.

                print($tag);
            }
            else if ($tag[strlen($tag) - 2] == '/')
            {
                // Self-closing tag.
                print($tag);
            }
            else
            {
                // Opening tag.
                print($tag);
                $tags[] = $tagName;
            }
        }

        // Continue after the tag.
        $position = $tagPosition + strlen($tag);
    }

    // Print any remaining text.
    if ($printedLength < $maxLength && $position < strlen($html))
        print(substr($html, $position, $maxLength - $printedLength));

    // Close any open tags.
    while (!empty($tags))
        printf('</%s>', array_pop($tags));
}


printTruncated(10, '<b>&lt;Hello&gt;</b> <img src="world.png" alt="" /> world!'); print("\n");

printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); print("\n");

printTruncated(10, "<em><b>Hello</b>&#20;w\xC3\xB8rld!</em>"); print("\n");

Encoding note: The above code assumes the XHTML is UTF-8encoded. ASCII-compatible single-byte encodings (such as Latin-1) are also supported, just pass falseas the third argument. Other multibyte encodings are not supported, though you may hack in support by using mb_convert_encodingto convert to UTF-8 before calling the function, then converting back again in every printstatement.

编码说明：以上代码假设 XHTML 是UTF-8编码的。也支持ASCII 兼容的单字节编码（例如Latin-1），只需false作为第三个参数传递。不支持其他多字节编码，但您可以通过mb_convert_encoding在调用函数之前使用转换为 UTF-8，然后在每个print语句中再次转换回来来获得支持。

(You should alwaysbe using UTF-8, though.)

（不过，您应该始终使用 UTF-8。）

Edit: Updated to handle character entities and UTF-8. Fixed bug where the function would print one character too many, if that character was a character entity.

编辑：更新以处理字符实体和 UTF-8。修复了如果该字符是字符实体，该函数会打印过多字符的错误。

Answer 2

回答by alockwood05

I've written a function that truncates HTML just as yous suggest, but instead of printing it out it puts it just keeps it all in a string variable. handles HTML Entities, as well.

我已经编写了一个函数，可以按照您的建议截断 HTML，但它没有将其打印出来，而是将其全部保存在一个字符串变量中。也处理 HTML 实体。

 /**
     *  function to truncate and then clean up end of the HTML,
     *  truncates by counting characters outside of HTML tags
     *  
     *  @author alex lockwood, alex dot lockwood at websightdesign
     *  
     *  @param string $str the string to truncate
     *  @param int $len the number of characters
     *  @param string $end the end string for truncation
     *  @return string $truncated_html
     *  
     *  **/
        public static function truncateHTML($str, $len, $end = '&hellip;'){
            //find all tags
            $tagPattern = '/(<\/?)([\w]*)(\s*[^>]*)>?|&[\w#]+;/i';  //match html tags and entities
            preg_match_all($tagPattern, $str, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER );
            //WSDDebug::dump($matches); exit; 
            $i =0;
            //loop through each found tag that is within the $len, add those characters to the len,
            //also track open and closed tags
            // $matches[$i][0] = the whole tag string  --the only applicable field for html enitities  
            // IF its not matching an &htmlentity; the following apply
            // $matches[$i][1] = the start of the tag either '<' or '</'  
            // $matches[$i][2] = the tag name
            // $matches[$i][3] = the end of the tag
            //$matces[$i][$j][0] = the string
            //$matces[$i][$j][1] = the str offest

            while($matches[$i][0][1] < $len && !empty($matches[$i])){

                $len = $len + strlen($matches[$i][0][0]);
                if(substr($matches[$i][0][0],0,1) == '&' )
                    $len = $len-1;


                //if $matches[$i][2] is undefined then its an html entity, want to ignore those for tag counting
                //ignore empty/singleton tags for tag counting
                if(!empty($matches[$i][2][0]) && !in_array($matches[$i][2][0],array('br','img','hr', 'input', 'param', 'link'))){
                    //double check 
                    if(substr($matches[$i][3][0],-1) !='/' && substr($matches[$i][1][0],-1) !='/')
                        $openTags[] = $matches[$i][2][0];
                    elseif(end($openTags) == $matches[$i][2][0]){
                        array_pop($openTags);
                    }else{
                        $warnings[] = "html has some tags mismatched in it:  $str";
                    }
                }


                $i++;

            }

            $closeTags = '';

            if (!empty($openTags)){
                $openTags = array_reverse($openTags);
                foreach ($openTags as $t){
                    $closeTagString .="</".$t . ">"; 
                }
            }

            if(strlen($str)>$len){
                // Finds the last space from the string new length
                $lastWord = strpos($str, ' ', $len);
                if ($lastWord) {
                    //truncate with new len last word
                    $str = substr($str, 0, $lastWord);
                    //finds last character
                    $last_character = (substr($str, -1, 1));
                    //add the end text
                    $truncated_html = ($last_character == '.' ? $str : ($last_character == ',' ? substr($str, 0, -1) : $str) . $end);
                }
                //restore any open tags
                $truncated_html .= $closeTagString;


            }else
            $truncated_html = $str;


            return $truncated_html; 
        }

Answer 3

回答by Kornel

100% accurate, but pretty difficult approach:

100% 准确，但非常困难的方法：

Iterate charactes using DOM
Use DOM methods to remove remaining elements
Serialize the DOM

使用 DOM 迭代字符
使用 DOM 方法移除剩余元素
序列化 DOM

Easy brute-force approach:

简单的蛮力方法：

Split string into tags (not elements) and text fragments using preg_split('/(<tag>)/')with PREG_DELIM_CAPTURE.
Measure text length you want (it'll be every second element from split, you might use html_entity_decode()to help measure accurately)
Cut the string (trim &[^\s;]+$at the end to get rid of possibly chopped entity)
Fix it with HTML Tidy

使用preg_split('/(<tag>)/')PREG_DELIM_CAPTURE 将字符串拆分为标签（不是元素）和文本片段。
测量您想要的文本长度（它将是拆分后的每一个元素，您可以html_entity_decode()用来帮助准确测量）
切断字符串（&[^\s;]+$在最后修剪以摆脱可能被切碎的实体）
用 HTML Tidy 修复它

Answer 4

回答by Stefan Gehrig

The following is a simple state-machine parser which handles you test case successfully. I fails on nested tags though as it doesn't track the tags themselves. I also chokes on entities within HTML tags (e.g. in an href-attribute of an <a>-tag). So it cannot be considered a 100% solution to this problem but because it's easy to understand it could be the basis for a more advanced function.

下面是一个简单的状态机解析器，它可以成功地处理您的测试用例。我在嵌套标签上失败了，因为它不跟踪标签本身。我还对 HTML 标签内的实体感到窒息（例如，在-tag的href-attribute 中<a>）。所以它不能被认为是这个问题的 100% 解决方案，但因为它很容易理解，它可以成为更高级功能的基础。

function substr_html($string, $length)
{
    $count = 0;
    /*
     * $state = 0 - normal text
     * $state = 1 - in HTML tag
     * $state = 2 - in HTML entity
     */
    $state = 0;    
    for ($i = 0; $i < strlen($string); $i++) {
        $char = $string[$i];
        if ($char == '<') {
            $state = 1;
        } else if ($char == '&') {
            $state = 2;
            $count++;
        } else if ($char == ';') {
            $state = 0;
        } else if ($char == '>') {
            $state = 0;
        } else if ($state === 0) {
            $count++;
        }

        if ($count === $length) {
            return substr($string, 0, $i + 1);
        }
    }
    return $string;
}

Answer 5

回答by periklis

I used a nice function found at http://alanwhipple.com/2011/05/25/php-truncate-string-preserving-html-tags-words, apparently taken from CakePHP

我使用了在http://alanwhipple.com/2011/05/25/php-truncate-string-preserving-html-tags-words 上找到的一个不错的函数，显然取自 CakePHP

Answer 6

回答by Andrey Nagikh

Another light changes to S?ren L?vborg printTruncated function making it UTF-8 (Needs mbstring) compatible and making it return string not print one. I think it's more useful. And my code not use buffering like Bounce variant, just one more variable.

另一个对 S?ren L?vborg printTruncated 函数的改动使其与 UTF-8（需要 mbstring）兼容，并使其返回字符串而不是打印一个。我认为它更有用。而且我的代码不像 Bounce 变体那样使用缓冲，只是多了一个变量。

UPD: to make it work properly with utf-8 chars in tag attributes you need mb_preg_match function, listed below.

UPD：要使其与标签属性中的 utf-8 字符一起正常工作，您需要 mb_preg_match 函数，如下所列。

Great thanks to S?ren L?vborg for that function, it's very good.

非常感谢 S?ren L?vborg 的功能，非常好。

/* Truncate HTML, close opened tags
*
* @param int, maxlength of the string
* @param string, html       
* @return $html
*/

function htmlTruncate($maxLength, $html)
{
    mb_internal_encoding("UTF-8");
    $printedLength = 0;
    $position = 0;
    $tags = array();
    $out = "";

    while ($printedLength < $maxLength && mb_preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($tag, $tagPosition) = $match[0];

        // Print text leading up to the tag.
        $str = mb_substr($html, $position, $tagPosition - $position);
        if ($printedLength + mb_strlen($str) > $maxLength)
        {
            $out .= mb_substr($str, 0, $maxLength - $printedLength);
            $printedLength = $maxLength;
            break;
        }

        $out .= $str;
        $printedLength += mb_strlen($str);

        if ($tag[0] == '&')
        {
            // Handle the entity.
            $out .= $tag;
            $printedLength++;
        }
        else
        {
            // Handle the tag.
            $tagName = $match[1][0];
            if ($tag[1] == '/')
            {
                // This is a closing tag.

                $openingTag = array_pop($tags);
                assert($openingTag == $tagName); // check that tags are properly nested.

                $out .= $tag;
            }
            else if ($tag[mb_strlen($tag) - 2] == '/')
            {
                // Self-closing tag.
                $out .= $tag;
            }
            else
            {
                // Opening tag.
                $out .= $tag;
                $tags[] = $tagName;
            }
        }

        // Continue after the tag.
        $position = $tagPosition + mb_strlen($tag);
    }

    // Print any remaining text.
    if ($printedLength < $maxLength && $position < mb_strlen($html))
        $out .= mb_substr($html, $position, $maxLength - $printedLength);

    // Close any open tags.
    while (!empty($tags))
        $out .= sprintf('</%s>', array_pop($tags));

    return $out;
}

function mb_preg_match(
    $ps_pattern,
    $ps_subject,
    &$pa_matches,
    $pn_flags = 0,
    $pn_offset = 0,
    $ps_encoding = NULL
) {
    // WARNING! - All this function does is to correct offsets, nothing else:
    //(code is independent of PREG_PATTER_ORDER / PREG_SET_ORDER)

    if (is_null($ps_encoding)) $ps_encoding = mb_internal_encoding();

    $pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding));
    $ret = preg_match($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset);

    if ($ret && ($pn_flags & PREG_OFFSET_CAPTURE))
        foreach($pa_matches as &$ha_match) {
                $ha_match[1] = mb_strlen(substr($ps_subject, 0, $ha_match[1]), $ps_encoding);
        }

    return $ret;
}

Answer 7

回答by gpilotino

you can use tidyas well:

你也可以使用tidy：

function truncate_html($html, $max_length) {   
  return tidy_repair_string(substr($html, 0, $max_length),
     array('wrap' => 0, 'show-body-only' => TRUE), 'utf8'); 
}

Answer 8

回答by DavidJ

The CakePHPframework has a HTML-aware truncate() function in the Text Helper that works for me. See Text. MIT license. Link to source(provided by @Quentin).

该CakePHP的框架已在文本助手一个HTML感知截断（）函数，这对我的作品。见文本。麻省理工学院执照。源链接（由@Quentin 提供）。

Answer 9

回答by DavidJ

Could use DomDocument in this case with a nasty regex hack, worst that would happen is a warning, if there's a broken tag :

在这种情况下可以使用 DomDocument 进行令人讨厌的正则表达式黑客攻击，如果标签损坏，最糟糕的情况是警告：

$dom = new DOMDocument();
$dom->loadHTML(substr("Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m a web developer.",0,26));
$html = preg_replace("/\<\/?(body|html|p)>/", "", $dom->saveHTML());
echo $html;

Should give output : Hello, my <strong>**name**</strong>.

应该给出输出：Hello, my <strong>**name**</strong>。

Answer 10

回答by Bounce

I've made light changes to S?ren L?vborg printTruncatedfunction making it UTF-8 compatible:

我对 S?ren L?vborgprintTruncated函数进行了轻微的更改，使其与 UTF-8 兼容：

   /* Truncate HTML, close opened tags
    *
    * @param int, maxlength of the string
    * @param string, html       
    * @return $html
    */  
    function html_truncate($maxLength, $html){

        mb_internal_encoding("UTF-8");

        $printedLength = 0;
        $position = 0;
        $tags = array();

        ob_start();

        while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){

            list($tag, $tagPosition) = $match[0];

            // Print text leading up to the tag.
            $str = mb_strcut($html, $position, $tagPosition - $position);

            if ($printedLength + mb_strlen($str) > $maxLength){
                print(mb_strcut($str, 0, $maxLength - $printedLength));
                $printedLength = $maxLength;
                break;
            }

            print($str);
            $printedLength += mb_strlen($str);

            if ($tag[0] == '&'){
                // Handle the entity.
                print($tag);
                $printedLength++;
            }
            else{
                // Handle the tag.
                $tagName = $match[1][0];
                if ($tag[1] == '/'){
                    // This is a closing tag.

                    $openingTag = array_pop($tags);
                    assert($openingTag == $tagName); // check that tags are properly nested.

                    print($tag);
                }
                else if ($tag[mb_strlen($tag) - 2] == '/'){
                    // Self-closing tag.
                    print($tag);
                }
                else{
                    // Opening tag.
                    print($tag);
                    $tags[] = $tagName;
                }
            }

            // Continue after the tag.
            $position = $tagPosition + mb_strlen($tag);
        }

        // Print any remaining text.
        if ($printedLength < $maxLength && $position < mb_strlen($html))
            print(mb_strcut($html, $position, $maxLength - $printedLength));

        // Close any open tags.
        while (!empty($tags))
             printf('</%s>', array_pop($tags));


        $bufferOuput = ob_get_contents();

        ob_end_clean();         

        $html = $bufferOuput;   

        return $html;   

    }

php 截断包含 HTML 的文本，忽略标签

提问by SamWM

回答by S?ren L?vborg

回答by alockwood05

回答by Kornel

回答by Stefan Gehrig

回答by periklis

回答by Andrey Nagikh

回答by gpilotino

回答by DavidJ

回答by DavidJ

回答by Bounce

相关推荐

最近更新

标签

php 截断包含 HTML 的文本，忽略标签

提问by SamWM

回答by S?ren L?vborg

回答by alockwood05

回答by Kornel

回答by Stefan Gehrig

回答by periklis

回答by Andrey Nagikh

回答by gpilotino

回答by DavidJ

回答by DavidJ

回答by Bounce

相关推荐

php 如何插入 PDO (sqllite3)？

php PDO Prepared 在单个查询中插入多行

限制函数或命令 PHP 的执行时间

PHP：如何删除字符串中的所有不可打印字符？

相关推荐

最近更新

标签