在 C++ 字符串中转义 XML/HTML 的最有效方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5665231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 18:38:26  来源:igfitidea点击:

Most efficient way to escape XML/HTML in C++ string?

c++algorithmstringstl

提问by paperjam

I can't believe this question hasn't been asked before. I have a string that needs to be inserted into an HTML file but it may contain special HTML characters. I want to replace these with the appropriate HTML representation.

我不敢相信这个问题以前没有人问过。我有一个字符串需要插入到 HTML 文件中,但它可能包含特殊的 HTML 字符。我想用适当的 HTML 表示替换这些。

The code below works but is pretty verbose and ugly. Performance is not critical for my application but I guess there are scalability problems here also. How can I improve this? I guess this is a job for STL algorithms or some esoteric Boost function, but the code below is the best I can come up with myself.

下面的代码有效,但非常冗长和丑陋。性能对我的应用程序并不重要,但我想这里也存在可扩展性问题。我该如何改进?我想这是 STL 算法或一些深奥的 Boost 函数的工作,但下面的代码是我自己能想到的最好的代码。

void escape(std::string *data)
{
    std::string::size_type pos = 0;
    for (;;)
    {
        pos = data->find_first_of("\"&<>", pos);
        if (pos == std::string::npos) break;
        std::string replacement;
        switch ((*data)[pos])
        {
        case '\"': replacement = "&quot;"; break;   
        case '&':  replacement = "&amp;";  break;   
        case '<':  replacement = "&lt;";   break;   
        case '>':  replacement = "&gt;";   break;   
        default: ;
        }
        data->replace(pos, 1, replacement);
        pos += replacement.size();
    };
}

回答by Giovanni Funchal

Instead of just replacing in the original string, you can do copying with on-the-fly replacement which avoids having to move characters in the string. This will have much better complexity and cache behavior, so I'd expect a huge improvement. Or you can use boost::spirit::xml encodeor http://code.google.com/p/pugixml/.

除了替换原始字符串之外,您还可以使用即时替换进行复制,从而避免移动字符串中的字符。这将有更好的复杂性和缓存行为,所以我期待一个巨大的改进。或者您可以使用boost::spirit::xml 编码http://code.google.com/p/pugixml/

void encode(std::string& data) {
    std::string buffer;
    buffer.reserve(data.size());
    for(size_t pos = 0; pos != data.size(); ++pos) {
        switch(data[pos]) {
            case '&':  buffer.append("&amp;");       break;
            case '\"': buffer.append("&quot;");      break;
            case '\'': buffer.append("&apos;");      break;
            case '<':  buffer.append("&lt;");        break;
            case '>':  buffer.append("&gt;");        break;
            default:   buffer.append(&data[pos], 1); break;
        }
    }
    data.swap(buffer);
}

EDIT:A small improvement can be achieved by using an heuristic to determine the size of the buffer. Replace the buffer.reserveline with data.size()*1.1(10%) or something similar depending of how much replacements are expected.

编辑:通过使用启发式来确定缓冲区的大小可以实现一个小的改进。buffer.reservedata.size()*1.1(10%) 或类似的东西替换该行,具体取决于预期的替换次数。

回答by paperjam

void escape(std::string *data)
{
    using boost::algorithm::replace_all;
    replace_all(*data, "&",  "&amp;");
    replace_all(*data, "\"", "&quot;");
    replace_all(*data, "\'", "&apos;");
    replace_all(*data, "<",  "&lt;");
    replace_all(*data, ">",  "&gt;");
}

Could win the prize for least verbose?

能赢得最不冗长的奖吗?

回答by Aswath

Here is a simple ~30 line C program that does the trick in a rather good manner. Here I am assuming that the temp_str will have allocated memory enough to have the additional escaped characters.

这是一个简单的 ~30 行 C 程序,它以相当好的方式完成了这个技巧。在这里,我假设 temp_str 将分配足够的内存以容纳额外的转义字符。

void toExpatEscape(char *temp_str)
{
    const char cEscapeChars[6]={'&','\'','\"','>','<','
#ifndef _HTML_PREPROCESS_H_
#define _HTML_PREPROCESS_H_

#include <string>

class HtmlPreprocess
{
public:
    HtmlPreprocess();
    ~HtmlPreprocess();

    static void htmlspecialchars(
        const std::string & in,
        std::string & out
        );
};

#endif // _HTML_PREPROCESS_H_
'}; const char * const pEscapedSeqTable[] = { "&amp;", "&apos;", "&quot;", "&gt;", "&lt;", }; unsigned int i, j, k, nRef = 0, nEscapeCharsLen = strlen(cEscapeChars), str_len = strlen(temp_str); int nShifts = 0; for (i=0; i<str_len; i++) { for(nRef=0; nRef<nEscapeCharsLen; nRef++) { if(temp_str[i] == cEscapeChars[nRef]) { if((nShifts = strlen(pEscapedSeqTable[nRef]) - 1) > 0) { memmove(temp_str+i+nShifts, temp_str+i, str_len-i+nShifts); for(j=i,k=0; j<=i+nShifts,k<=nShifts; j++,k++) temp_str[j] = pEscapedSeqTable[nRef][k]; str_len += nShifts; } } } } temp_str[str_len] = '
#include "HtmlPreprocess.h"


HtmlPreprocess::HtmlPreprocess()
{
}


HtmlPreprocess::~HtmlPreprocess()
{
}


const unsigned char map_char_to_final_size[] = 
{
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   6,   1,   1,   1,   5,   6,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   4,   1,   4,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1
};


const unsigned char map_char_to_index[] = 
{
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   2,      0xFF,   0xFF,   0xFF,   0,      1,      0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   4,      0xFF,   3,      0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF
};


void HtmlPreprocess::htmlspecialchars(
    const std::string & in,
    std::string & out
    )
{
    const char * lp_in_stored = &in[0];
    size_t in_size = in.size();

    const char * lp_in = lp_in_stored;
    size_t final_size = 0;
    for (size_t i = 0; i < in_size; i++)
        final_size += map_char_to_final_size[*lp_in++];

    out.resize(final_size);

    lp_in = lp_in_stored;
    char * lp_out = &out[0];

    for (size_t i = 0; i < in_size; i++)
    {
        char current_char = *lp_in++;
        unsigned char next_action = map_char_to_index[current_char];

        switch (next_action){
        case 0:
            *lp_out++ = '&';
            *lp_out++ = 'a';
            *lp_out++ = 'm';
            *lp_out++ = 'p';
            *lp_out++ = ';';
            break;
        case 1:
            *lp_out++ = '&';
            *lp_out++ = 'a';
            *lp_out++ = 'p';
            *lp_out++ = 'o';
            *lp_out++ = 's';
            *lp_out++ = ';';
            break;
        case 2:
            *lp_out++ = '&';
            *lp_out++ = 'q';
            *lp_out++ = 'u';
            *lp_out++ = 'o';
            *lp_out++ = 't';
            *lp_out++ = ';';
            break;
        case 3:
            *lp_out++ = '&';
            *lp_out++ = 'g';
            *lp_out++ = 't';
            *lp_out++ = ';';
            break;
        case 4:
            *lp_out++ = '&';
            *lp_out++ = 'l';
            *lp_out++ = 't';
            *lp_out++ = ';';
            break;
        default:
            *lp_out++ = current_char;
        }
    }
}
'; }

回答by HostageBrain

My tests showed thisanswer gave the best performance from offered (not surprising it has the most rate).
I've implemented same algorithm for my project (I really want good performance & memory usage) - my tests showed my implementation has ~2.6-3.25 better speed performace. Also I don't like previous best offered algorithm bcs of bad memory usage - you will have extra memory usage as when apply 1.1 multiplier 'heuristic', as when .append() lead to resize.
So, leave my code here - maybe somebody find it useful.

HtmlPreprocess.h:

我的测试表明,这个答案给出了所提供的最佳性能(毫不奇怪,它的速率最高)。
我已经为我的项目实现了相同的算法(我真的想要良好的性能和内存使用) - 我的测试表明我的实现有 ~2.6-3.25 更好的速度性能。此外,我不喜欢以前最好的算法 bcs 的内存使用情况不佳 - 当应用 1.1 乘数“启发式”时,您将有额外的内存使用量,因为 .append() 导致调整大小。
所以,把我的代码留在这里——也许有人觉得它很有用。

HtmlPreprocess.h:

#include <algorithm>

namespace xml {

    // Helper for null-terminated ASCII strings (no end of string iterator).
    template<typename InIter, typename OutIter>
    OutIter copy_asciiz ( InIter begin, OutIter out )
    {
        while ( *begin != '
#include <iterator>
#include <string>

namespace xml {

    // Get escaped version of "content".
    std::string escape ( const std::string& content )
    {
        std::string result;
        result.reserve(content.size());
        escape(content.begin(), content.end(), std::back_inserter(result));
        return (result);
    }

    // Escape data on the fly, using "constant" memory.
    void escape ( std::istream& in, std::ostream& out )
    {
        escape(std::istreambuf_iterator<char>(in),
            std::istreambuf_iterator<char>(),
            std::ostreambuf_iterator<char>(out));
    }

}
' ) { *out++ = *begin++; } return (out); } // XML escaping in it's general form. Note that 'out' is expected // to an "infinite" sequence. template<typename InIter, typename OutIter> OutIter escape ( InIter begin, InIter end, OutIter out ) { static const char bad[] = "&<>"; static const char* rep[] = {"&amp;", "&lt;", "&gt;"}; static const std::size_t n = sizeof(bad)/sizeof(bad[0]); for ( ; (begin != end); ++begin ) { // Find which replacement to use. const std::size_t i = std::distance(bad, std::find(bad, bad+n, *begin)); // No need for escaping. if ( i == n ) { *out++ = *begin; } // Escape the character. else { out = copy_asciiz(rep[i], out); } } return (out); } }

HtmlPreprocess.cpp:

HtmlPreprocess.cpp:

#include <iostream>

int main ( int, char ** )
{
    std::cout << xml::escape("<foo>bar & qux</foo>") << std::endl;
}

回答by André Caron

I'd honestly go with a more generic version using iterators, such that you can "stream" the encoding. Consider the following implementation:

老实说,我会使用迭代器使用更通用的版本,这样您就可以“流式传输”编码。考虑以下实现:

const unsigned char calcFinalSize[] =
{
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   6,   1,   1,   1,   5,   6,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   4,   1,   4,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1
};

void escapeXml(std::string & in)
{
    const char* dataIn = in.data();
    size_t sizeIn = in.size();

    const char* dataInCurrent = dataIn;
    const char* dataInEnd = dataIn + sizeIn;
    size_t outSize = 0;
    while (dataInCurrent < dataInEnd)
    {
        outSize += calcFinalSize[static_cast<uint8_t>(*dataInCurrent)];
        dataInCurrent++;
    }


    if (outSize == sizeIn)
    {
        return;
    }
    std::string out;
    out.resize(outSize);

    dataInCurrent = dataIn;
    char* dataOut = &out[0];
    while (dataInCurrent < dataInEnd)
    {
        switch (*dataInCurrent) {
        case '&':
            memcpy(dataOut, "&amp;", sizeof("&amp;") - 1);
            dataOut += sizeof("&amp;") - 1;
            break;
        case '\'':
            memcpy(dataOut, "&apos;", sizeof("&apos;") - 1);
            dataOut += sizeof("&apos;") - 1;
            break;
        case '\"':
            memcpy(dataOut, "&quot;", sizeof("&quot;") - 1);
            dataOut += sizeof("&quot;") - 1;
            break;
        case '>':
            memcpy(dataOut, "&gt;", sizeof("&gt;") - 1);
            dataOut += sizeof("&gt;") - 1;
            break;
        case '<':
            memcpy(dataOut, "&lt;", sizeof("&lt;") - 1);
            dataOut += sizeof("&lt;") - 1;
            break;
        default:
            *dataOut++ = *dataInCurrent;
        }
        dataInCurrent++;
    }
    in.swap(out);
}

Then, you can simplify the average case using a few overloads:

然后,您可以使用一些重载来简化平均情况:

template<class Str>
Str encode_char_entities(const Str &s)
{
    // Don't do anything for empty strings.
    if(s.empty()) return s;

    typedef typename Str::value_type Ch;

    Str r;
    // To properly round-trip spaces and not uglify the XML beyond
    // recognition, we have to encode them IF the text contains only spaces.
    Str sp(1, Ch(' '));
    if(s.find_first_not_of(sp) == Str::npos) {
        // The first will suffice.
        r = detail::widen<Str>("&#32;");
        r += Str(s.size() - 1, Ch(' '));
    } else {
        typename Str::const_iterator end = s.end();
        for (typename Str::const_iterator it = s.begin(); it != end; ++it)
        {
            switch (*it)
            {
                case Ch('<'): r += detail::widen<Str>("&lt;"); break;
                case Ch('>'): r += detail::widen<Str>("&gt;"); break;
                case Ch('&'): r += detail::widen<Str>("&amp;"); break;
                case Ch('"'): r += detail::widen<Str>("&quot;"); break;
                case Ch('\''): r += detail::widen<Str>("&apos;"); break;
                default: r += *it; break;
            }
        }
    }
    return r;
}

Finally, test the whole lot:

最后,测试整个批次:

 std::string& rep(std::string &s, std::string from, std::string to)
    {
      int pos = -1;
      while ( (pos = s.find(from, pos+1) ) != string::npos)
        s.erase(pos, from.length()).insert(pos, to);

      return s;
    }

回答by Anthony Johnson

If you're going for processing speed, then it seems to me that the best would be to have a second string that you build as you go, copying from the first string to the second string, and then appending the html escapes as you encounter them. Since I assume that the replace method involves first a memory move, followed by a copy into the replaced position, it's going to be very slow for large strings. If you have a second string to build using .append(), it will avoid the memory move.

如果您要提高处理速度,那么在我看来,最好的方法是创建第二个字符串,从第一个字符串复制到第二个字符串,然后在遇到时附加 html 转义符他们。由于我假设替换方法首先涉及内存移动,然后复制到替换位置,因此对于大字符串,它会非常慢。如果您要使用 .append() 构建第二个字符串,它将避免内存移动。

As far was code "cleanness", I think that's about as pretty as you're going to get. You could create an array of characters and their replacements, and then search the array, but that would probably be slower and not much cleaner anyway.

就代码“清洁度”而言,我认为这与您将要获得的一样漂亮。您可以创建一个字符及其替换数组,然后搜索该数组,但这可能会更慢,而且无论如何也不会更清晰。

回答by PatrickF

I profiled 3 solutions with Visual Studio 2017. Input were 10 000 000 strings of size 5-20 with a probability of 9,4% that a char needs to be escaped.

我使用 Visual Studio 2017 分析了 3 个解决方案。输入是 10 000 000 个大小为 5-20 的字符串,字符需要转义的概率为 9.4%。

  1. Solution from Giovanni Funchal
  2. Solution from HostageBrain
  3. Solution is mine
  1. Giovanni Funchal 的解决方案
  2. HostageBrain 的解决方案
  3. 解决方案是我的

The result:

结果:

  1. needs 1.675 seconds
  2. needs 0.769 seconds
  3. needs 0.368 seconds
  1. 需要 1.675 秒
  2. 需要 0.769 秒
  3. 需要 0.368 秒

In mine Solution, the final size is precalculated and a copy of string data is done, only when needed. So the heap memory allocations should be minimal.

在我的解决方案中,只有在需要时才预先计算最终大小并完成字符串数据的副本。所以堆内存分配应该是最小的。

rep(s, "&", "&quot;");
rep(s, "\"", "&quot;");

Edit: Replaced "&quote;"with "&quot;". Old solution was overwriting memory, because the look-up table contained a length of 6 for "&quote;".

编辑:替换"&quote;""&quot;". 旧的解决方案是覆盖内存,因为查找表包含 6 的长度"&quote;"

回答by topkara

You can use the boost::property_tree::xml_parser::encode_char_entitiesif you don't want to write it yourself.

boost::property_tree::xml_parser::encode_char_entities如果您不想自己编写,可以使用。

For reference, here's the code in boost 1.64.0:

作为参考,这里是代码boost 1.64.0

```

``

rep(s, "HTML","xxxx");

```

``

回答by etipici

Or with just stl :

或者只使用 stl :

##代码##

Usage:

用法:

##代码##

or:

或者:

##代码##