C/C++ URL 解码库

Question

提问by michael

I am developing a c/c++ program on linux. Can you please tell me if there is any c/c++ library which decodes url?

我正在 linux 上开发 ac/c++ 程序。你能告诉我是否有任何解码 url 的 c/c++ 库吗？

I am looking for libraries which convert "http%3A%2F%2F" to: "http://"

我正在寻找将“http%3A%2F%2F”转换为：“http://”的库

or "a+t+%26+t" to "a t & t"

或 "a+t+%26+t" 到 "at & t"

Thank you.

谢谢你。

Answer 1

回答by ThomasH

I actually used Saul's function in an analysis program I was writing (analyzing millions of URL encoded strings), and while it works, at that scale it was slowing my program down horribly, so I decided to write a faster version. This one is thousands of times faster when compiled with GCC and the -O2 option. It can also use the same output buffer as the input (e.g. urldecode2(buf, buf) will work if the original string was in buf and is to be overwritten by its decoded counterpart).

我实际上在我正在编写的分析程序中使用了 Saul 的函数（分析了数百万个 URL 编码的字符串），虽然它可以工作，但在这种规模下它会严重减慢我的程序，所以我决定编写一个更快的版本。当使用 GCC 和 -O2 选项编译时，这个速度要快数千倍。它还可以使用与输入相同的输出缓冲区（例如 urldecode2(buf, buf) 如果原始字符串在 buf 中并且将被其解码的副本覆盖，则将起作用）。

Edit:It doesn't take the buffer size as an input because it is assumed that the buffer will be large enough, this is safe because it is known that the length of the output will always be <= that of the input, so either use the same buffer for the output or create one that's at least the size of the input + 1 for the null terminator, e.g.:

编辑：它不将缓冲区大小作为输入，因为假设缓冲区足够大，这是安全的，因为已知输出的长度将始终 <= 输入的长度，因此要么对输出使用相同的缓冲区或创建一个至少为输入大小 + 1 的空终止符，例如：

char *output = malloc(strlen(input)+1);
urldecode2(output, input);
printf("Decoded string: %s\n", output);

Edit 2:An anonymous user attempted to edit this answer to handle the '+' character's translation to ' ', which I think it should probably do, again this wasn't something that I needed for my application, but I've added it below.

编辑 2：一个匿名用户试图编辑这个答案来处理 '+' 字符到 ' ' 的翻译，我认为它可能应该这样做，同样这不是我的应用程序需要的东西，但我已经添加了它以下。

Here's the routine:

这是例行公事：

#include <stdlib.h>
#include <ctype.h>

void urldecode2(char *dst, const char *src)
{
        char a, b;
        while (*src) {
                if ((*src == '%') &&
                    ((a = src[1]) && (b = src[2])) &&
                    (isxdigit(a) && isxdigit(b))) {
                        if (a >= 'a')
                                a -= 'a'-'A';
                        if (a >= 'A')
                                a -= ('A' - 10);
                        else
                                a -= '0';
                        if (b >= 'a')
                                b -= 'a'-'A';
                        if (b >= 'A')
                                b -= ('A' - 10);
                        else
                                b -= '0';
                        *dst++ = 16*a+b;
                        src+=3;
                } else if (*src == '+') {
                        *dst++ = ' ';
                        src++;
                } else {
                        *dst++ = *src++;
                }
        }
        *dst++ = 'int percent_decode(char* out, const char* in) {
{
    static const char tbl[256] = {
        -1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
        -1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
        -1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
         0, 1, 2, 3, 4, 5, 6, 7,  8, 9,-1,-1,-1,-1,-1,-1,
        -1,10,11,12,13,14,15,-1, -1,-1,-1,-1,-1,-1,-1,-1,
        -1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
        -1,10,11,12,13,14,15,-1, -1,-1,-1,-1,-1,-1,-1,-1,
        -1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
        -1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
        -1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
        -1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
        -1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
        -1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
        -1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
        -1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
        -1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1
    };
    char c, v1, v2, *beg=out;
    if(in != NULL) {
        while((c=*in++) != '#include <assert.h>

void urldecode(char *pszDecodedOut, size_t nBufferSize, const char *pszEncodedIn)
{
    memset(pszDecodedOut, 0, nBufferSize);

    enum DecodeState_e
    {
        STATE_SEARCH = 0, ///< searching for an ampersand to convert
        STATE_CONVERTING, ///< convert the two proceeding characters from hex
    };

    DecodeState_e state = STATE_SEARCH;

    for(unsigned int i = 0; i < strlen(pszEncodedIn)-1; ++i)
    {
        switch(state)
        {
        case STATE_SEARCH:
            {
                if(pszEncodedIn[i] != '%')
                {
                    strncat(pszDecodedOut, &pszEncodedIn[i], 1);
                    assert(strlen(pszDecodedOut) < nBufferSize);
                    break;
                }

                // We are now converting
                state = STATE_CONVERTING;
            }
            break;

        case STATE_CONVERTING:
            {
                // Conversion complete (i.e. don't convert again next iter)
                state = STATE_SEARCH;

                // Create a buffer to hold the hex. For example, if %20, this
                // buffer would hold 20 (in ASCII)
                char pszTempNumBuf[3] = {0};
                strncpy(pszTempNumBuf, &pszEncodedIn[i], 2);

                // Ensure both characters are hexadecimal
                bool bBothDigits = true;

                for(int j = 0; j < 2; ++j)
                {
                    if(!isxdigit(pszTempNumBuf[j]))
                        bBothDigits = false;
                }

                if(!bBothDigits)
                    break;

                // Convert two hexadecimal characters into one character
                int nAsciiCharacter;
                sscanf(pszTempNumBuf, "%x", &nAsciiCharacter);

                // Ensure we aren't going to overflow
                assert(strlen(pszDecodedOut) < nBufferSize);

                // Concatenate this character onto the output
                strncat(pszDecodedOut, (char*)&nAsciiCharacter, 1);

                // Skip the next character
                i++;
            }
            break;
        }
    }
}
') {
            if(c == '%') {
                if((v1=tbl[(unsigned char)*in++])<0 || 
                   (v2=tbl[(unsigned char)*in++])<0) {
                    *beg = '#include <stdio.h>

int decodeURIComponent (char *sSource, char *sDest) {
    int nLength;
    for (nLength = 0; *sSource; nLength++) {
        if (*sSource == '%' && sSource[1] && sSource[2] && isxdigit(sSource[1]) && isxdigit(sSource[2])) {
            sSource[1] -= sSource[1] <= '9' ? '0' : (sSource[1] <= 'F' ? 'A' : 'a')-10;
            sSource[2] -= sSource[2] <= '9' ? '0' : (sSource[2] <= 'F' ? 'A' : 'a')-10;
            sDest[nLength] = 16 * sSource[1] + sSource[2];
            sSource += 3;
            continue;
        }
        sDest[nLength] = *sSource++;
    }
    sDest[nLength] = 'int main () {

    char sMyUrl[] = "http%3a%2F%2ffoo+bar%2fabcd";

    int nNewLength = implodeURIComponent(sMyUrl);

    /* Let's print: "http://foo+bar/abcd\nLength: 19" */
    printf("%s\nLength: %d\n", sMyUrl, nNewLength);

    return 0;

}
';
    return nLength;
}

#define implodeURIComponent(url) decodeURIComponent(url, url)
';
                    return -1;
                }
                c = (v1<<4)|v2;
            }
            *out++ = c;
        }
    }
    *out = 'const char ascii_hex_4bit[23] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0, 0, 10, 11, 12, 13, 14, 15};

static inline char to_upper(char c)
{
    if ((c >= 'a') && (c <= 'z')) return c ^ 0x20;
    return c;
}

char *url_decode(const char *str)
{
    size_t i, j, len = strlen(str);
    char c, d, url_hex;
    char *decoded = malloc(len + 1);

    if (decoded == NULL) return NULL;

    i = 0;
    j = 0;
    do
    {
        c = str[i];
        d = 0;

        if (c == '%')
        {
            url_hex = to_upper(str[++i]);
            if (((url_hex >= '0') && (url_hex <= '9')) || ((url_hex >= 'A') && (url_hex <= 'F')))
            {
                d = ascii_hex_4bit[url_hex - 48] << 4;

                url_hex = to_upper(str[++i]);
                if (((url_hex >= '0') && (url_hex <= '9')) || ((url_hex >= 'A') && (url_hex <= 'F')))
                {
                    d |= ascii_hex_4bit[url_hex - 48];
                }
                else
                {
                    d = 0;
                }
            }
        }
        else if (c == '+')
        {
            d = ' ';
        }
        else if ((c == '*') || (c == '-') || (c == '.') || ((c >= '0') && (c <= '9')) ||
        ((c >= 'A') && (c <= 'Z')) || (c == '_') || ((c >= 'a') && (c <= 'z')))
        {
            d = c;
        }

        decoded[j++] = d;
        ++i;
    } while ((i < len) && (d != 0));

    decoded[j] = 0;
    return decoded;
}
';
    return 0;
}
';
}

Answer 2

回答by chmike

Here is a C decoder for a percent encoded string. It returns -1 if the encoding is invalid and 0 otherwise. The decoded string is stored in out. I'm quite sure this is the fastest code of the answers given so far.

这是用于百分比编码字符串的 C 解码器。如果编码无效，则返回 -1，否则返回 0。解码后的字符串存储在 out 中。我很确定这是迄今为止给出的答案中最快的代码。

/**
 * Locale-independent conversion of ASCII characters to lowercase.
 */
int av_tolower(int c)
{
    if (c >= 'A' && c <= 'Z')
        c ^= 0x20;
    return c;
}
/**
 * Decodes an URL from its percent-encoded form back into normal
 * representation. This function returns the decoded URL in a string.
 * The URL to be decoded does not necessarily have to be encoded but
 * in that case the original string is duplicated.
 *
 * @param url a string to be decoded.
 * @return new string with the URL decoded or NULL if decoding failed.
 * Note that the returned string should be explicitly freed when not
 * used anymore.
 */
char *urldecode(const char *url)
{
    int s = 0, d = 0, url_len = 0;
    char c;
    char *dest = NULL;

    if (!url)
        return NULL;

    url_len = strlen(url) + 1;
    dest = av_malloc(url_len);

    if (!dest)
        return NULL;

    while (s < url_len) {
        c = url[s++];

        if (c == '%' && s + 2 < url_len) {
            char c2 = url[s++];
            char c3 = url[s++];
            if (isxdigit(c2) && isxdigit(c3)) {
                c2 = av_tolower(c2);
                c3 = av_tolower(c3);

                if (c2 <= '9')
                    c2 = c2 - '0';
                else
                    c2 = c2 - 'a' + 10;

                if (c3 <= '9')
                    c3 = c3 - '0';
                else
                    c3 = c3 - 'a' + 10;

                dest[d++] = 16 * c2 + c3;

            } else { /* %zz or something other invalid */
                dest[d++] = c;
                dest[d++] = c2;
                dest[d++] = c3;
            }
        } else if (c == '+') {
            dest[d++] = ' ';
        } else {
            dest[d++] = c;
        }

    }

    return dest;
}

by
www.elesos.com

Answer 3

回答by Saul

This function I've just whipped up is very lightweight and should do as you wish, note I haven't programmed this to strict URI standards (used what I know off the top of my head). It's buffer-safe and doesn't overflow as far as I can see; adapt as you deem fit:

我刚刚创建的这个函数非常轻量级，应该按照你的意愿去做，注意我没有将它编程到严格的 URI 标准（使用我头脑中知道的东西）。据我所知，它是缓冲区安全的，不会溢出；适应你认为合适的：

##代码##

Answer 4

回答by unwind

The ever-excellent glibhas some URI functions, including scheme-extraction, escaping and un-escaping.

永远优秀的glib具有一些URI 功能，包括方案提取、转义和非转义。

Answer 5

回答by Cristian Adam

uriparserlibrary is small and lightweight.

uriparser库体积小，重量轻。

Answer 6

回答by piotr

Try urlcpp https://github.com/larroy/urlcppIt's a C++ module that you can easily integrate in your project, depends on boost::regex

试试 urlcpp https://github.com/larroy/urlcpp这是一个 C++ 模块，你可以很容易地集成到你的项目中，依赖于 boost::regex

Answer 7

回答by carmellose

I'd suggest curl and libcurl. It's widely used and should do the trick for you. Just check their website.

我建议curl 和 libcurl。它被广泛使用，应该可以为您解决问题。只需检查他们的网站。

Answer 8

回答by grufo

Thanks to @ThomasH for his answer. I'd like to propose here a better formattation…

感谢@ThomasH 的回答。我想在这里提出一个更好的格式……

And… since the decoded URI component is alwaysless long than the same encoded URI component, is always possible to implode it within the samearray of characters (a.k.a.: "string"). So, I'll propose here two possibilities:

而且……因为解码的 URI 组件总是比相同的编码 URI 组件的长度短，所以总是可以在相同的字符数组（又名：“字符串”）中内爆它。所以，我在这里提出两种可能性：

##代码##

And, finally…:

而且，最后……：

##代码##

Ste*

圣*

Answer 9

回答by Marc Heimann

Came across this 8 year old question as I was looking for the same. Based on previous answers, I also wrote my own version which is independent from libs, easy to understand and probably fast (no benchmark). Tested code with gcc, it should decode until end or invalid character (not tested). Just remember to free allocated space.

遇到了这个 8 岁的问题，因为我正在寻找同样的问题。根据之前的答案，我还编写了自己的独立于库的版本，易于理解且速度可能很快（没有基准测试）。使用 gcc 测试代码，它应该解码直到结束或无效字符（未测试）。请记住释放分配的空间。

##代码##

Answer 10

回答by hnrayer

##代码##

C/C++ URL 解码库

提问by michael

回答by ThomasH

回答by chmike

回答by Saul

回答by unwind

回答by Cristian Adam

回答by piotr

回答by carmellose

回答by grufo

回答by Marc Heimann

回答by hnrayer

相关推荐

最近更新

标签

C/C++ URL 解码库

提问by michael

回答by ThomasH

回答by chmike

回答by Saul

回答by unwind

回答by Cristian Adam

回答by piotr

回答by carmellose

回答by grufo

回答by Marc Heimann

回答by hnrayer

相关推荐

C++ 深拷贝与浅拷贝

在 C++ 中比较 char 和 Int

C++ 虚拟继承如何解决“钻石”（多重继承）歧义？

此声明在 C++ 中没有存储类或类型说明符

相关推荐

最近更新

标签