Linux UTF-8 贯穿始终
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/279170/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UTF-8 all the way through
提问by mercutio
I'm setting up a new server and want to support UTF-8 fully in my web application. I have tried this in the past on existing servers and always seem to end up having to fall back to ISO-8859-1.
我正在设置一个新服务器并希望在我的 Web 应用程序中完全支持 UTF-8。我过去曾在现有服务器上尝试过,但似乎总是最终不得不退回到 ISO-8859-1。
Where exactly do I need to set the encoding/charsets? I'm aware that I need to configure Apache, MySQL, and PHP to do this — is there some standard checklist I can follow, or perhaps troubleshoot where the mismatches occur?
我到底需要在哪里设置编码/字符集?我知道我需要配置 Apache、MySQL 和 PHP 才能做到这一点——是否有一些我可以遵循的标准清单,或者可能对发生不匹配的地方进行故障排除?
This is for a new Linux server, running MySQL 5, PHP, 5 and Apache 2.
这是一个新的 Linux 服务器,运行 MySQL 5、PHP、5 和 Apache 2。
采纳答案by chazomaticus
Data Storage:
数据存储:
Specify the
utf8mb4
character set on all tables and text columns in your database. This makes MySQL physically store and retrieve values encoded natively in UTF-8. Note that MySQL will implicitly useutf8mb4
encoding if autf8mb4_*
collation is specified (without any explicit character set).In older versions of MySQL (< 5.5.3), you'll unfortunately be forced to use simply
utf8
, which only supports a subset of Unicode characters. I wish I were kidding.
指定
utf8mb4
数据库中所有表和文本列的字符集。这使得 MySQL 在物理上存储和检索以 UTF-8 本地编码的值。请注意,utf8mb4
如果utf8mb4_*
指定了排序规则(没有任何显式字符集),MySQL 将隐式使用编码。在旧版本的 MySQL (< 5.5.3) 中,不幸的是,您将被迫使用 simple
utf8
,它只支持 Unicode 字符的一个子集。我希望我是在开玩笑。
Data Access:
数据访问:
In your application code (e.g. PHP), in whatever DB access method you use, you'll need to set the connection charset to
utf8mb4
. This way, MySQL does no conversion from its native UTF-8 when it hands data off to your application and vice versa.Some drivers provide their own mechanism for configuring the connection character set, which both updates its own internal state and informs MySQL of the encoding to be used on the connection—this is usually the preferred approach. In PHP:
If you're using the PDOabstraction layer with PHP ≥ 5.3.6, you can specify
charset
in the DSN:$dbh = new PDO('mysql:charset=utf8mb4');
If you're using mysqli, you can call
set_charset()
:$mysqli->set_charset('utf8mb4'); // object oriented style mysqli_set_charset($link, 'utf8mb4'); // procedural style
If you're stuck with plain mysqlbut happen to be running PHP ≥ 5.2.3, you can call
mysql_set_charset
.
If the driver does not provide its own mechanism for setting the connection character set, you may have to issue a query to tell MySQL how your application expects data on the connection to be encoded:
SET NAMES 'utf8mb4'
.The same consideration regarding
utf8mb4
/utf8
applies as above.
在您的应用程序代码(例如 PHP)中,无论您使用何种数据库访问方法,您都需要将连接字符集设置为
utf8mb4
. 这样,当 MySQL 将数据传递给您的应用程序时,它不会从其原生 UTF-8 进行转换,反之亦然。一些驱动程序提供了自己的配置连接字符集的机制,它既更新自己的内部状态,又通知 MySQL 将在连接上使用的编码——这通常是首选方法。在 PHP 中:
如果您使用PHP ≥ 5.3.6的PDO抽象层,您可以
charset
在DSN 中指定:$dbh = new PDO('mysql:charset=utf8mb4');
如果您使用的是mysqli,则可以调用
set_charset()
:$mysqli->set_charset('utf8mb4'); // object oriented style mysqli_set_charset($link, 'utf8mb4'); // procedural style
如果您坚持使用普通mysql但碰巧运行 PHP ≥ 5.2.3,则可以调用
mysql_set_charset
.
如果驱动程序没有提供自己的设置连接字符集的机制,您可能必须发出一个查询来告诉 MySQL 您的应用程序希望如何对连接上的数据进行编码:
SET NAMES 'utf8mb4'
。关于
utf8mb4
/ 的考虑与utf8
上述相同。
Output:
输出:
If your application transmits text to other systems, they will also need to be informed of the character encoding. With web applications, the browser must be informed of the encoding in which data is sent (through HTTP response headers or HTML metadata).
In PHP, you can use the
default_charset
php.ini option, or manually issue theContent-Type
MIME header yourself, which is just more work but has the same effect.When encoding the output using
json_encode()
, addJSON_UNESCAPED_UNICODE
as a second parameter.
如果您的应用程序将文本传输到其他系统,则它们也需要知道字符编码。对于 Web 应用程序,必须通知浏览器发送数据的编码(通过 HTTP 响应头或HTML 元数据)。
在 PHP 中,您可以使用
default_charset
php.ini 选项,或者Content-Type
自己手动发出MIME 标头,这只是更多的工作但具有相同的效果。使用 对输出进行编码时
json_encode()
,添加JSON_UNESCAPED_UNICODE
为第二个参数。
Input:
输入:
Unfortunately, you should verify every received string as being valid UTF-8 before you try to store it or use it anywhere. PHP's
mb_check_encoding()
does the trick, but you have to use it religiously. There's really no way around this, as malicious clients can submit data in whatever encoding they want, and I haven't found a trick to get PHP to do this for you reliably.From my reading of the current HTML spec, the following sub-bullets are not necessary or even valid anymore for modern HTML. My understanding is that browsers will work with and submit data in the character set specified for the document. However, if you're targeting older versions of HTML (XHTML, HTML4, etc.), these points may still be useful:
- For HTML before HTML5 only: you want all data sent to you by browsers to be in UTF-8. Unfortunately, if you go by the the only way to reliably do this is add the
accept-charset
attribute to all your<form>
tags:<form ... accept-charset="UTF-8">
. - For HTML before HTML5 only: note that the W3C HTML spec says that clients "should" default to sending forms back to the server in whatever charset the server served, but this is apparently only a recommendation, hence the need for being explicit on every single
<form>
tag.
- For HTML before HTML5 only: you want all data sent to you by browsers to be in UTF-8. Unfortunately, if you go by the the only way to reliably do this is add the
不幸的是,在尝试将其存储或在任何地方使用之前,您应该验证每个接收到的字符串是否为有效的 UTF-8。PHP
mb_check_encoding()
可以解决问题,但您必须虔诚地使用它。真的没有办法解决这个问题,因为恶意客户端可以以他们想要的任何编码提交数据,而且我还没有找到让 PHP 可靠地为您执行此操作的技巧。从我对当前HTML 规范的阅读来看,以下子项目符号对于现代 HTML 不再是必需的,甚至不再有效。我的理解是浏览器将使用为文档指定的字符集处理和提交数据。但是,如果您的目标是旧版本的 HTML(XHTML、HTML4 等),以下几点可能仍然有用:
- 仅适用于 HTML5 之前的 HTML:您希望浏览器发送给您的所有数据均为 UTF-8。不幸的是,如果您要可靠地做到这一点的唯一方法是将
accept-charset
属性添加到您的所有<form>
标签:<form ... accept-charset="UTF-8">
. - 仅适用于 HTML5 之前的 HTML:请注意,W3C HTML 规范说客户端“应该”默认以服务器所提供的任何字符集将表单发送回服务器,但这显然只是一个建议,因此需要对每一个都进行明确
<form>
标签。
- 仅适用于 HTML5 之前的 HTML:您希望浏览器发送给您的所有数据均为 UTF-8。不幸的是,如果您要可靠地做到这一点的唯一方法是将
Other Code Considerations:
其他代码注意事项:
Obviously enough, all files you'll be serving (PHP, HTML, JavaScript, etc.) should be encoded in valid UTF-8.
You need to make sure that every time you process a UTF-8 string, you do so safely. This is, unfortunately, the hard part. You'll probably want to make extensive use of PHP's
mbstring
extension.PHP's built-in string operations are notby default UTF-8 safe.There are some things you can safely do with normal PHP string operations (like concatenation), but for most things you should use the equivalent
mbstring
function.To know what you're doing (read: not mess it up), you really need to know UTF-8 and how it works on the lowest possible level. Check out any of the links from utf8.comfor some good resources to learn everything you need to know.
很明显,您将提供的所有文件(PHP、HTML、JavaScript 等)都应该以有效的 UTF-8 编码。
您需要确保每次处理 UTF-8 字符串时都是安全的。不幸的是,这是困难的部分。您可能希望广泛使用 PHP 的
mbstring
扩展。PHP 的内置字符串操作在默认情况下不是UTF-8 安全的。有些事情你可以用普通的 PHP 字符串操作(比如连接)安全地做,但对于大多数事情,你应该使用等效的
mbstring
函数。要知道您在做什么(阅读:不要搞砸了),您确实需要了解 UTF-8 以及它在尽可能低的级别上是如何工作的。查看utf8.com 上的任何链接,获取一些好的资源,以了解您需要了解的一切。
回答by JW.
In PHP, you'll need to either use the multibyte functions, or turn on mbstring.func_overload. That way things like strlen will work if you have characters that take more than one byte.
在 PHP 中,您需要使用多字节函数,或者打开mbstring.func_overload。如果您的字符占用超过一个字节,那么像 strlen 这样的东西就会起作用。
You'll also need to identify the character set of your responses. You can either use AddDefaultCharset, as above, or write PHP code that returns the header. (Or you can add a META tag to your HTML documents.)
您还需要确定响应的字符集。您可以使用 AddDefaultCharset,如上所述,也可以编写返回标头的 PHP 代码。(或者您可以在 HTML 文档中添加 META 标记。)
回答by chroder
In addition to setting default_charset
in php.ini, you can send the correct charset using header()
from within your code, before any output:
除了default_charset
在 php.ini 中设置之外,您还可以header()
在任何输出之前使用从代码中发送正确的字符集:
header('Content-Type: text/html; charset=utf-8');
Working with Unicode in PHP is easy as long as you realize that most of the string functions don't work with Unicode, and some might mangle strings completely. PHP considers "characters" to be 1 byte long. Sometimes this is okay (for example, explode()
only looks for a byte sequence and uses it as a separator -- so it doesn't matter what actual characters you look for). But other times, when the function is actually designed to work on characters, PHP has no idea that your text has multi-byte characters that are found with Unicode.
在 PHP 中使用 Unicode 很容易,只要您意识到大多数字符串函数不适用于 Unicode,并且有些函数可能会完全破坏字符串。PHP 认为“字符”的长度为 1 个字节。有时这是可以的(例如,explode()
只查找字节序列并将其用作分隔符——因此您查找的实际字符并不重要)。但其他时候,当函数实际设计用于处理字符时,PHP 不知道您的文本具有使用 Unicode 找到的多字节字符。
A good library to check into is phputf8. This rewrites all of the "bad" functions so you can safely work on UTF8 strings. There are extensions like the mbstring extension that try to do this for you, too, but I prefer using the library because it's more portable (but I write mass-market products, so that's important for me). But phputf8 can use mbstring behind the scenes, anyway, to increase performance.
一个很好的检查库是phputf8。这将重写所有“坏”函数,因此您可以安全地处理 UTF8 字符串。也有像 mbstring 扩展这样的扩展尝试为你做这件事,但我更喜欢使用这个库,因为它更便携(但我编写大众市场产品,所以这对我很重要)。但是 phputf8 可以在幕后使用 mbstring 来提高性能。
回答by jalf
Unicode support in PHP is still a huge mess. While it's capable of converting an ISO8859 string (which it uses internally) to utf8, it lacks the capability to work with unicode strings natively, which means all the string processing functions will mangle and corrupt your strings. So you have to either use a separate library for proper utf8 support, or rewrite all the string handling functions yourself.
PHP 中的 Unicode 支持仍然是一团糟。虽然它能够将 ISO8859 字符串(它在内部使用)转换为 utf8,但它缺乏本地处理 unicode 字符串的能力,这意味着所有字符串处理函数都会破坏和破坏您的字符串。因此,您必须使用单独的库以获得正确的 utf8 支持,或者自己重写所有字符串处理函数。
The easy part is just specifying the charset in HTTP headers and in the database and such, but none of that matters if your PHP code doesn't output valid UTF8. That's the hard part, and PHP gives you virtually no help there. (I think PHP6 is supposed to fix the worst of this, but that's still a while away)
简单的部分只是在 HTTP 标头和数据库等中指定字符集,但如果您的 PHP 代码不输出有效的 UTF8,这些都不重要。这是困难的部分,PHP 在这方面几乎没有提供任何帮助。(我认为 PHP6 应该可以解决最糟糕的问题,但这还需要一段时间)
回答by mercator
I'd like to add one thing to chazomaticus' excellent answer:
我想在chazomaticus 的优秀答案中添加一件事:
Don't forget the META tag either (like this, or the HTML4 or XHTML version of it):
不要忘记 META 标签(像这样,或者它的 HTML4 或 XHTML 版本):
<meta charset="utf-8">
That seems trivial, but IE7 has given me problems with that before.
这似乎微不足道,但 IE7 之前给我带来了问题。
I was doing everything right; the database, database connection and Content-Type HTTP header were all set to UTF-8, and it worked fine in all other browsers, but Internet Explorer still insisted on using the "Western European" encoding.
我做的一切都是正确的;数据库、数据库连接和 Content-Type HTTP 标头都设置为 UTF-8,它在所有其他浏览器中都运行良好,但 Internet Explorer 仍然坚持使用“西欧”编码。
It turned out the page was missing the META tag. Adding that solved the problem.
原来该页面缺少 META 标签。添加解决了问题。
Edit:
编辑:
The W3C actually has a rather large section dedicated to I18N. They have a number of articles related to this issue – describing the HTTP, (X)HTML and CSS side of things:
W3C 实际上有一个相当大的部分专门用于 I18N。他们有很多与这个问题相关的文章——描述了 HTTP、(X)HTML 和 CSS 方面的事情:
- FAQ: Changing (X)HTML page encoding to UTF-8
- Declaring character encodings in HTML
- Tutorial: Character sets & encodings in XHTML, HTML and CSS
- Setting the HTTP charset parameter
They recommend using both the HTTP header and HTML meta tag (or XML declaration in case of XHTML served as XML).
他们建议同时使用 HTTP 标头和 HTML 元标记(或者在 XHTML 作为 XML 的情况下使用 XML 声明)。
回答by commonpike
The top answer is excellent. Here is what I had to on a regular debian/php/mysql setup:
最高的答案是极好的。这是我在常规 debian/php/mysql 设置中必须执行的操作:
// storage
// debian. apparently already utf-8
// retrieval
// the mysql database was stored in utf-8,
// but apparently php was requesting iso. this worked:
// ***notice "utf8", without dash, this is a mysql encoding***
mysql_set_charset('utf8');
// delivery
// php.ini did not have a default charset,
// (it was commented out, shared host) and
// no http encoding was specified in the apache headers.
// this made apache send out a utf-8 header
// (and perhaps made php actually send out utf-8)
// ***notice "utf-8", with dash, this is a php encoding***
ini_set('default_charset','utf-8');
// submission
// this worked in all major browsers once apache
// was sending out the utf-8 header. i didnt add
// the accept-charset attribute.
// processing
// changed a few commands in php, like substr,
// to mb_substr
that was all !
就这些 !
回答by JDelage
In my case, I was using mb_split
, which uses regex. Therefore I also had to manually make sure the regex encoding was utf-8 by doing mb_regex_encoding('UTF-8');
就我而言,我使用的是mb_split
,它使用正则表达式。因此,我还必须手动确保正则表达式编码为 utf-8mb_regex_encoding('UTF-8');
As a side note, I also discovered by running mb_internal_encoding()
that the internal encoding wasn't utf-8, and I changed that by running mb_internal_encoding("UTF-8");
.
作为旁注,我还通过运行发现mb_internal_encoding()
内部编码不是 utf-8,我通过运行mb_internal_encoding("UTF-8");
.
回答by Jim W.
I found an issue with someone using PDO and the answer was to use this for the PDO connection string:
我发现有人使用 PDO 存在问题,答案是将其用于 PDO 连接字符串:
$pdo = new PDO(
'mysql:host=mysql.example.com;dbname=example_db',
"username",
"password",
array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8"));
The site I took this from is down, but I was able to get it using the Google cache, luckily.
我从中获取的站点已关闭,但幸运的是我能够使用 Google 缓存获取它。
回答by Miguel Stevens
I recently discovered that using strtolower()
can cause issues where the data is truncated after a special character.
我最近发现 usingstrtolower()
会导致数据在特殊字符后被截断的问题。
The solution was to use
解决方案是使用
mb_strtolower($string, 'UTF-8');
mb_ uses MultiByte. It supports more characters but in general is a little slower.
mb_ 使用多字节。它支持更多的字符,但一般来说会慢一点。
回答by Jimmy Kane
First of all if you are in < 5.3PHP then no. You've got a ton of problems to tackle.
首先,如果你在 < 5.3PHP 中,那么不。你有很多问题需要解决。
I am surprised that none has mentioned the intllibrary, the one that has good support for unicode, graphemes, string operations, localisationand many more, see below.
我很惊讶,没有一个提到的国际图书馆,有良好的支持,一个unicode的,字形,字符串操作,本地化和更多的人,见下文。
I will quote some information about unicode support in PHP by Elizabeth Smith'sslidesat PHPBenelux'14
我将在PHPBenelux'14 上引用Elizabeth Smith 的幻灯片中有关 PHP 中 unicode 支持的一些信息
INTL
国际机场
Good:
好的:
- Wrapper around ICU library
- Standardised locales, set locale per script
- Number formatting
- Currency formatting
- Message formatting (replaces gettext)
- Calendars, dates, timezone and time
- Transliterator
- Spoofchecker
- Resource bundles
- Convertors
- IDN support
- Graphemes
- Collation
- Iterators
- ICU 图书馆的包装
- 标准化语言环境,为每个脚本设置语言环境
- 数字格式
- 货币格式
- 消息格式(替换 gettext)
- 日历、日期、时区和时间
- 音译者
- 恶搞检查器
- 资源包
- 转换器
- 国际化域名支持
- 字素
- 整理
- 迭代器
Bad:
坏的:
- Does not support zend_multibite
- Does not support HTTP input output conversion
- Does not support function overloading
- 不支持 zend_multibite
- 不支持 HTTP 输入输出转换
- 不支持函数重载
mb_string
mb_string
- Enables zend_multibyte support
- Supports transparent HTTP in/out encoding
- Provides some wrappers for funtionallity such as strtoupper
- 启用 zend_multibyte 支持
- 支持透明的 HTTP 输入/输出编码
- 为功能提供一些包装器,例如 strtoupper
ICONV
图标
- Primary for charset conversion
- Output buffer handler
- mime encoding functionality
- conversion
- some string helpers (len, substr, strpos, strrpos)
- Stream Filter
stream_filter_append($fp, 'convert.iconv.ISO-2022-JP/EUC-JP')
- 主要用于字符集转换
- 输出缓冲区处理程序
- mime 编码功能
- 转换
- 一些字符串助手(len、substr、strpos、strrpos)
- 流过滤器
stream_filter_append($fp, 'convert.iconv.ISO-2022-JP/EUC-JP')
DATABASES
数据库
- mysql: Charset and collation on tables and on connection (not the collation). Also don't use mysql - msqli or PDO
- postgresql: pg_set_client_encoding
- sqlite(3): Make sure it was compiled with unicode and intl support
- mysql:表和连接上的字符集和排序规则(不是排序规则)。也不要使用 mysql - msqli 或 PDO
- postgresql:pg_set_client_encoding
- sqlite(3):确保它是用 unicode 和 intl 支持编译的
Some other Gotchas
其他一些问题
- You cannot use unicode filenames with PHP and windows unless you use a 3rd part extension.
- Send everything in ASCII if you are using exec, proc_open and other command line calls
- Plain text is not plain text, files have encodings
- You can convert files on the fly with the iconv filter
- 除非您使用第三部分扩展名,否则您不能在 PHP 和 Windows 中使用 unicode 文件名。
- 如果您使用 exec、proc_open 和其他命令行调用,则以 ASCII 格式发送所有内容
- 纯文本不是纯文本,文件有编码
- 您可以使用 iconv 过滤器即时转换文件
I ll update this answer in case things change features added and so on.
如果事情改变了添加的功能等等,我会更新这个答案。