PHP:在不知道原始字符集的情况下将任何字符串转换为 UTF-8,或者至少尝试

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7979567/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-26 03:43:06  来源:igfitidea点击:

PHP: Convert any string to UTF-8 without knowing the original character set, or at least try

phputf-8character-encoding

提问by Grim...

I have an application that deals with clients from all over the world, and, naturally, I want everything going into my databases to be UTF-8 encoded.

我有一个与来自世界各地的客户打交道的应用程序,当然,我希望进入我的数据库的所有内容都采用 UTF-8 编码。

The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using <form accept-charset="utf-8">is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.

对我来说的主要问题是我不知道任何字符串的来源将是什么编码 - 它可能来自文本框(<form accept-charset="utf-8">仅当用户实际提交表单时使用才有用),或者它可能是来自上传的文本文件,所以我真的无法控制输入。

What I need is a function or class that makes sure the stuff going into my database is, as far as is possible, UTF-8 encoded. I've tried iconv(mb_detect_encoding($text), "UTF-8", $text);but that has problems (if the input is 'fiancée' it returns 'fianc'). I've tried a lot of things =/

我需要的是一个函数或类,以确保进入我的数据库的内容尽可能采用 UTF-8 编码。我试过了,iconv(mb_detect_encoding($text), "UTF-8", $text);但是有问题(如果输入是“未婚夫”,它会返回“未婚夫”)。我尝试了很多东西=/

For file uploads, I like the idea of asking the end user to specify the encoding they use, and show them previews of what the output will look like, but this doesn't help against nasty hackers (in fact, it could make their life a little easier).

对于文件上传,我喜欢要求最终用户指定他们使用的编码,并向他们展示输出的预览,但这无助于抵御讨厌的黑客(事实上,这可能会让他们的生活变得更糟)容易一点)。

I've read the other SO questions on the subject, but they seem to all have subtle differences like "I need to parse RSS feeds" or "I scrape data from websites" (or, indeed, "You can't").

我已经阅读了有关该主题的其他 SO 问题,但它们似乎都有细微的差异,例如“我需要解析 RSS 提要”或“我从网站上抓取数据”(或者,实际上,“你不能”)。

But there must be something that at least has a good try!

但必须有一些东西至少有一个很好的尝试

回答by Jeff Day

What you're asking for is extremely hard. If possible, getting the user to specify the encoding is the best. Preventing an attack shouldn't be much easier or harder that way.

你要的东西太难了。如果可能,最好让用户指定编码。以这种方式防止攻击不应该更容易或更难。

However, you could try doing this:

但是,您可以尝试这样做:

iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);

Setting it to strict might help you get a better result.

将其设置为严格可能会帮助您获得更好的结果。

回答by Oroboros102

In motherland Russia we have 4 popular encodings, so your question is in great demand here.

在祖国俄罗斯,我们有 4 种流行的编码,因此您的问题在这里很受欢迎。

Only by char codes of symbols you can not detect encoding, because code pages intersect. Some codepages in different languages have even full intersection. So, we need another approach.

仅通过符号的字符代码无法检测编码,因为代码页相交。一些不同语言的代码页甚至有完全交集。所以,我们需要另一种方法

The only way to work with unknown encodings is working with probabilities. So, we do not want to answer the question "what is encoding of this text?", we are trying to understand "what is most likely encoding of this text?".

处理未知编码的唯一方法是处理概率。所以,我们不想回答“这个文本的编码是什么?”这个问题,我们试图理解“这个文本最有可能的编码是什么?”。

One guy here in popular Russian tech blog invented this approach:

流行的俄罗斯科技博客中的一个人发明了这种方法:

Build the probability range of char codes in every encoding you want to support. You can build it using some big texts in your language (e.g. some fiction, use Shakespeare for english and Tolstoy for russian, lol ). You will get smth like this:

在您想要支持的每种编码中构建字符代码的概率范围。您可以使用您语言中的一些大文本来构建它(例如一些小说,使用莎士比亚的英语和托尔斯泰的俄语,哈哈)。你会得到这样的结果:

    encoding_1:
    190 => 0.095249209893009,
    222 => 0.095249209893009,
    ...
    encoding_2:
    239 => 0.095249209893009,
    207 => 0.095249209893009,
    ...
    encoding_N:
    charcode => probabilty

Next. You take text in unknown encoding and for every encoding in your "probability dictionary" you search for frequency of every symbol in unknown-encoded text. Sum probabilities of symbols. Encoding with bigger rating is likely the winner. Better results for bigger texts.

下一个。您采用未知编码的文本,对于“概率词典”中的每个编码,您搜索未知编码文本中每个符号的频率。符号的总和概率。具有更高评级的编码可能是赢家。更大文本的更好结果。

If you are interested, I can gladly help you with this task. We can greatly increase the accuracy by building two-charcodes probabilty list.

如果您有兴趣,我很乐意帮助您完成这项任务。我们可以通过构建两个字符的概率列表来大大提高准确性。

Btw. mb_detect_encoding certanly does not work. Yes, at all. Please, take a look of mb_detect_encoding source code in "ext/mbstring/libmbfl/mbfl/mbfl_ident.c".

顺便提一句。mb_detect_encoding 肯定不起作用。是的,完全没有。请查看“ext/mbstring/libmbfl/mbfl/mbfl_ident.c”中的 mb_detect_encoding 源代码。

回答by Alexey Gerasimov

You've probably tried this to but why not just use the mb_convert_encoding function? It will attempt to auto-detect char set of the text provided or you can pass it a list.

您可能已经尝试过,但为什么不直接使用 mb_convert_encoding 函数呢?它将尝试自动检测所提供文本的字符集,或者您可以将其传递给一个列表。

Also, I tried to run:

另外,我尝试运行:

$text = "fiancée";
echo mb_convert_encoding($text, "UTF-8");
echo "<br/><br/>";
echo iconv(mb_detect_encoding($text), "UTF-8", $text);

and the results are the same for both. How do you see that your text is truncated to 'fianc'? is it in the DB or in a browser?

并且两者的结果相同。您如何看待您的文本被截断为“未婚夫”?它是在数据库中还是在浏览器中?

回答by matthiasmullie

There is no way to identify the charset of a string that is completely accurate. There are ways to try to guess the charset. One of these ways, and probably/currently the best in PHP, is mb_detect_encoding(). This will scan your string and look for occurrences of stuff unique to certain charsets. Depending on your string, there may not be such distinguishable occurrences.

无法识别完全准确的字符串的字符集。有多种方法可以尝试猜测字符集。这些方法之一,可能/目前在 PHP 中是最好的,是 mb_detect_encoding()。这将扫描您的字符串并查找某些字符集独有的内容。根据您的字符串,可能没有这种可区分的出现。

Take the ISO-8859-1 charset vs ISO-8859-15 ( http://en.wikipedia.org/wiki/ISO/IEC_8859-15#Changes_from_ISO-8859-1)

以 ISO-8859-1 字符集与 ISO-8859-15 ( http://en.wikipedia.org/wiki/ISO/IEC_8859-15#Changes_from_ISO-8859-1)为例

There's only a handful of different characters, and to make it worse, they're represented by the same bytes. There is no way to detect, being given a string without knowing it's encoding, whether byte 0xA4 is supposed to signify ¤ or in your string, so there is no way to know it's exact charset.

只有少数不同的字符,更糟糕的是,它们由相同的字节表示。没有办法检测,在不知道编码的情况下给出一个字符串,字节 0xA4 是否应该表示 ¤ 或在您的字符串中,因此无法知道它是确切的字符集。

(Note: you could add a human factor, or an even more advanced scanning technique (e.g. what Oroboros102 suggests), to try to figure out based upon the surrounding context, if the character should be ¤ or , though this seems like a bridge too far)

(注意:您可以添加人为因素,或者更高级的扫描技术(例如 Oroboros102 建议的),以尝试根据周围的上下文来确定角色是否应该是 ¤ 或 ,尽管这似乎也是一座桥梁远的)

There are more distinguishable differences between e.g. UTF-8 and ISO-8859-1, so it's still worth trying to figure it out when you're unsure, though you can and should never rely on it being correct.

例如,UTF-8 和 ISO-8859-1 之间有更多明显的区别,所以当你不确定时仍然值得尝试弄清楚,尽管你可以也不应该依赖它是正确的。

Interesting read: http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html#how-do-i-determine-the-charset-encoding-of-a-string

有趣的阅​​读:http: //kore-nordmann.de/blog/php_charset_encoding_FAQ.html#how-do-i-determine-the-charset-encoding-of-a-string

There are other ways of ensuring the correct charset though. Concerning forms, try to enforce UTF-8 as much as possible (check out snowman to make sure yout submission will be UTF-8 in every browser: http://intertwingly.net/blog/2010/07/29/Rails-and-Snowmen) That being done, at least you're can be sure that every text submitted through your forms is utf_8. Concerning uploaded files, try running the unix 'file -i' command on it through e.g. exec() (if possible on your server) to aid the detection (using the document's BOM.) Concerning scraping data, you could read the HTTP headers, that usually specify the charset. When parsing XML files, see if the XML meta-data contain a charset definition.

不过,还有其他方法可以确保正确的字符集。关于表单,尽量强制使用 UTF-8(查看雪人以确保您提交的内容在每个浏览器中都是 UTF-8:http: //intertwingly.net/blog/2010/07/29/Rails-and -Snowmen) 这样做,至少您可以确定通过表单提交的每个文本都是 utf_8。关于上传的文件,请尝试通过例如 exec()(如果可能在您的服务器上)在其上运行 unix 'file -i' 命令以帮助检测(使用文档的 BOM。)关于抓取数据,您可以读取 HTTP 标头,通常指定字符集。解析 XML 文件时,查看 XML 元数据是否包含字符集定义。

Rather than trying to automagically guess the charset, you should first try to ensure a certain charset yourself where possible, or trying to grab a definition from the source you're getting it from (if applicable) before resorting to detection.

与其尝试自动猜测字符集,您应该首先尝试在可能的情况下自己确保某个字符集,或者在诉诸检测之前尝试从获取它的源(如果适用)中获取定义。

回答by Anthony Rutledge

There are some really good answers and attempts to answer your question here. I am not an encoding master, but I understand your desire to have a pureUTF-8 stack all the way through to your database. I have been using MySQL's utf8mb4encoding for tables, fields, and connections.

这里有一些非常好的答案并尝试回答您的问题。我不是编码大师,但我理解您希望拥有一个一直到您的数据库的UTF-8 堆栈。我一直在utf8mb4对表、字段和连接使用 MySQL 的编码。

My situation boiled down to "I just want my sanitizers, validators, business logic, and prepared statements to deal with UTF-8 when data comes from HTML forms, or e-mail registration links." So, in my simple way, I started off with this idea:

我的情况归结为“当数据来自 HTML 表单或电子邮件注册链接时,我只希望我的消毒剂、验证器、业务逻辑和准备好的语句处理 UTF-8。” 所以,以我简单的方式,我从这个想法开始:

  1. Attempt to detect encoding: $encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];
  2. If encoding cannot be detected, throw new RuntimeException
  3. If input is UTF-8, carry on.
  4. Else, if it is ISO-8859-1or ASCII

    a. Attempt conversion to UTF-8 (wait, not finished)

    b. Detect the encoding of the converted value

    c. If the reported encoding and converted value are both UTF-8, carry on.

    d. Else, throw new RuntimeException

  1. 尝试检测编码: $encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];
  2. 如果无法检测到编码, throw new RuntimeException
  3. 如果输入为UTF-8,则继续。
  4. 否则,如果是ISO-8859-1ASCII

    一种。尝试转换为 UTF-8(等待,未完成)

    湾 检测转换值的编码

    C。如果报告的编码和转换值都是UTF-8,则继续。

    d. 别的,throw new RuntimeException

From my abstract class Sanitizer

从我的抽象类 Sanitizer

Sanitizer

消毒剂

    private function isUTF8($encoding, $value)
    {
        return (($encoding === 'UTF-8') && (utf8_encode(utf8_decode($value)) === $value));
    }

    private function utf8tify(&$value)
    {
        $encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];

        mb_internal_encoding('UTF-8');
        mb_substitute_character(0xfffd); //REPLACEMENT CHARACTER
        mb_detect_order($encodings);

        $stringEncoding = mb_detect_encoding($value, $encodings, true);

        if (!$stringEncoding) {
            $value = null;
            throw new \RuntimeException("Unable to identify character encoding in sanitizer.");
        }

        if ($this->isUTF8($stringEncoding, $value)) {
            return;
        } else {
            $value = mb_convert_encoding($value, 'UTF-8', $stringEncoding);
            $stringEncoding = mb_detect_encoding($value, $encodings, true);

            if ($this->isUTF8($stringEncoding, $value)) {
                return;
            } else {
                $value = null;
                throw new \RuntimeException("Unable to convert character encoding from ISO-8859-1, or ASCII, to UTF-8 in Sanitizer.");
            }
        }

        return;
    }

One could make an argument that I should separate encoding concernsfrom my abstract Sanitizerclass and simply inject an Encoderobject into a concrete child instance of Sanitizer. However, the main problem with my approach is that, without more knowledge, I simply reject encoding types that I do not want (and I am relying on PHP mb_* functions). Without further study, I cannot know if that hurts some populations or not (or, if I am losing out on important information). So, I need to learn more. I found this article.

有人可能会提出一个论点,即我应该将编码问题与我的抽象Sanitizer分开,并简单地将一个Encoder对象注入Sanitizer. 但是,我的方法的主要问题是,在没有更多知识的情况下,我只是拒绝了我不想要的编码类型(并且我依赖于 PHP mb_* 函数)。如果没有进一步的研究,我不知道这是否会伤害某些人群(或者,我是否丢失了重要信息)。所以,我需要学习更多。我找到了这篇文章。

What every programmer absolutely, positively needs to know about encodings and character sets to work with text

每个程序员绝对需要了解的有关编码和字符集的知识才能处理文本

Moreover, what happens when encrypted data is added to my email registration links (using OpenSSLor mcrypt)? Could this interfere with decoding? What about Windows-1252? What about security implications? The use of utf8_decode()and utf8_encode()in Sanitizer::isUTF8are dubious.

此外,将加密数据添加到我的电子邮件注册链接(使用OpenSSLmcrypt)时会发生什么?这会干扰解码吗?Windows-1252 怎么样?安全隐患呢?utf8_decode()utf8_encode()in的使用Sanitizer::isUTF8是可疑的。

People have pointed out short-comings in the PHP mb_* functions. I never took time to investigate iconv, but if it works better than mb_*functions, let me know.

人们已经指出了 PHP mb_* 函数的缺点。我从来没有花时间去调查iconv,但如果它比 mb_*f​​unctions 更好用,请告诉我。

回答by hakre

The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.

对我来说的主要问题是我不知道任何字符串的来源将是什么编码 - 它可能来自文本框(仅当用户实际提交表单时使用才有用),或者它可能是来自上传的文本文件,所以我真的无法控制输入。

I don't think it's a problem. An application knows the source of the input. If it's from a form, use UTF-8 encoding in your case. That works. Just verify the data provided is correctly encoded (validation). Keep in mind that not all databases support UTF-8 in it's full range.

我不认为这是个问题。应用程序知道输入的来源。如果它来自表单,请在您的情况下使用 UTF-8 编码。那个有效。只需验证提供的数据是否正确编码(验证)。请记住,并非所有数据库都全面支持 UTF-8。

If it's a file you won't save it UTF-8 encoded into the database but in binary form. When you output the file again, use binary output as well, then this is totally transparent.

如果它是一个文件,您不会将其以 UTF-8 编码保存到数据库中,而是以二进制形式保存。当您再次输出文件时,也使用二进制输出,这样就完全透明了。

Your idea is nice that a user can tell the encoding, be he/she can tell anyway after downloading the file, as it's binary.

你的想法很好,用户可以告诉编码,他/她在下载文件后无论如何都可以告诉,因为它是二进制的。

So I must admit I don't see a specific issue you raise with your question. But maybe you can add some more details what your problem is.

所以我必须承认,我没有看到你提出的问题的具体问题。但也许您可以添加更多详细信息,您的问题是什么。

回答by wutz

If you're willing to "take this to the console", I'd recommend enca. Unlike the rather simplistic mb_detect_encoding, it uses "a mixture of parsing, statistical analysis, guessing and black magic to determine their encodings" (lol - see man page). However, you usually have to pass the language of the input file if you want to detect such country-specific encodings. (However, mb_detect_encodingessentially has the same requirement, as the encoding would have to appear "in the right place" in the list of passed encodings for it to be detectable at all.)

如果您愿意“将其带到控制台”,我建议您使用enca. 与相当简单的 不同mb_detect_encoding,它使用“解析、统计分析、猜测和黑魔法的混合来确定它们的编码”(lol - 参见手册页)。但是,如果要检测此类特定于国家/地区的编码,通常必须传递输入文件的语言。(但是,mb_detect_encoding本质上具有相同的要求,因为编码必须出现在传递的编码列表中的“正确位置”才能完全检测到它。)

encaalso came up here: How to find encoding of a file in Unix via script(s)

enca也出现在这里:How to find encoding of a file in Unix via script(s)

回答by Parris Varney

You could set up a set of metrics to try to guess which encoding is being used. Again, not perfect, but could catch some of the misses from mb_detect_encoding().

您可以设置一组指标来尝试猜测正在使用哪种编码。同样,并不完美,但可以从 mb_detect_encoding() 中捕获一些未命中。

回答by Quel Pino

It seems that your question is quite answered, but i have an approach that may simplify you case:

您的问题似乎得到了很好的回答,但我有一种方法可以简化您的案例:

I had a similar issue trying to return string data from mysql, even configuring both database and php to return strings formatted to utf-8. The only way i got the error was actually returning them from the database.

我在尝试从 mysql 返回字符串数据时遇到了类似的问题,甚至将数据库和 php 配置为返回格式化为 utf-8 的字符串。我得到错误的唯一方法实际上是从数据库中返回它们。

Finally, sailing through the web i found a really easy way to deal with it:

最后,通过网络航行,我找到了一个非常简单的方法来处理它:

Giving that you can save all those types of string data in your mysql in different formats and collations, what you only need to do is, right at your php connection file, set the collation to utf-8, like this:

假设您可以将所有这些类型的字符串数据以不同的格式和排序规则保存在 mysql 中,您只需要做的是,就在您的 php 连接文件中,将排序规则设置为 utf-8,如下所示:

$connection = new mysqli($server, $user, $pass, $db);
$connection->set_charset("utf8");

Wich means that first you save the data in any format or collation and you convert it only at the return to your php file.

这意味着首先您以任何格式或排序规则保存数据,并且仅在返回到您的 php 文件时才将其转换。

Hope it was helpful!

希望它有帮助!

回答by Pedro Estev?o

If the text is retrieved from a mysql database you may try adding this after BD connection.

如果文本是从 mysql 数据库中检索到的,您可以尝试在 BD 连接后添加它。

mysqli_set_charset($con, "utf8");

mysqli_set_charset($con, "utf8");

https://www.php.net/manual/en/mysqli.set-charset.php

https://www.php.net/manual/en/mysqli.set-charset.php