准备 PHP 应用程序以使用 UTF-8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6987929/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-26 01:45:15  来源:igfitidea点击:

Preparing PHP application to use with UTF-8

phpunicodeutf-8character-encodingwebserver

提问by Sfisioza

UTF-8 is de facto standard for web applications now, but PHP this is not a default encoding for PHP (until 6.0). Most of the server is set up for the ISO-8859-1 encoding by default.

UTF-8 现在是 Web 应用程序的事实上的标准,但 PHP 这不是 PHP 的默认编码(直到 6.0)。大多数服务器默认设置为 ISO-8859-1 编码。

How to overload the default settings in the .htaccessto be sure that everything goes well for UTF-8, locale etc.? Any options for the web server, Unix OS?

如何重载 中的默认设置.htaccess以确保 UTF-8、语言环境等一切正常?Web 服务器、Unix 操作系统的任何选项?

Is there any comprehensive list of those settings? E.g. mbstringoptions, iconvsettings, locale etc I should set up for each multi language project? Any pre defined .htaccessas an example?

是否有这些设置的完整列表?例如mbstringiconv我应该为每个多语言项目设置选项、设置、语言环境等吗?任何预先定义.htaccess的例子?

(In my particular case I need setup for the languages: English, Dutch and Russian. The server is in Ukraine).

(在我的特殊情况下,我需要设置语言:英语、荷兰语和俄语。服务器在乌克兰)。

回答by takeshin

Some useful options to have in .htaccess:

一些有用的选项.htaccess

########################################
# Locale settings
########################################

# See: http://php.net/manual/en/timezones.php
php_value date.timezone "Europe/Amsterdam"

SetEnv   LC_ALL  nl_NL.UTF-8

########################################
# Set up UTF-8 encoding
########################################

AddDefaultCharset UTF-8
AddCharset UTF-8 .php

php_value default_charset "UTF-8"

php_value iconv.input_encoding "UTF-8"
php_value iconv.internal_encoding "UTF-8"
php_value iconv.output_encoding "UTF-8"

php_value mbstring.internal_encoding UTF-8
php_value mbstring.http_output UTF-8
php_value mbstring.encoding_translation On
php_value mbstring.func_overload 6

# See also php functions:
# mysql_set_charset
# mysql_client_encoding

# database settings
#CREATE DATABASE db_name
#   CHARACTER SET utf8
#   DEFAULT CHARACTER SET utf8
#   COLLATE utf8_general_ci
#   DEFAULT COLLATE utf8_general_ci
#   ;
#
#ALTER DATABASE db_name
#   CHARACTER SET utf8
#   DEFAULT CHARACTER SET utf8
#   COLLATE utf8_general_ci
#   DEFAULT COLLATE utf8_general_ci
#   ;

#ALTER TABLE tbl_name
#   DEFAULT CHARACTER SET utf8
#   COLLATE utf8_general_ci
#   ;

回答by hakre

You're right UTF-8is a good choice for webapplications.

您说得对,UTF-8是 web 应用程序的不错选择。

Encoding is meta-information to the data that get's processed. As long as you know the encoding of the (binary) data, you know what you're dealing with. You start to get lost, if you don't know the encoding. I often call this a chain, if the encoding-chain is broken, the data will be broken. This is both true for displaying data as well as for security.

编码是获得处理的数据的元信息。只要你知道(二进制)数据的编码,你就知道你在处理什么。如果您不知道编码,您就会开始迷路。我经常称之为链,如果编码链被破坏,数据就会被破坏。这对于显示数据和安全性都是正确的。

As a rule of thumb, PHP is binary, it's the context/you who specifies the encoding (e.g. how you save your php source-code files).

根据经验,PHP 是二进制的,指定编码的是上下文/您(例如,您如何保存 php 源代码文件)。

So let's tackle a short (and incomplete) list:

因此,让我们处理一个简短(且不完整)的列表:

The OS

操作系统

Environment variables might tell you about the locale in use and the encoding. File-systems do have their encoding for names of files and directories for example. I'm not very firm to this subject, normally we try to name our files in english so to use only characters in the range of US-ASCIIwhich is safe for the Latin extended charsets like ISO-8859-1in your case as well as for UTF-8.

环境变量可能会告诉您正在使用的语言环境和编码。例如,文件系统确实对文件和目录的名称进行了编码。我对这个主题不是很坚定,通常我们尝试用英语命名我们的文件,以便仅使用范围内的US-ASCII字符对于拉丁扩展字符集是安全的,例如ISO-8859-1您的情况以及UTF-8.

Just keep this in mind when you save files your users upload: Just filter filenames to basic letters and punctation and you'll have nearly no hassles (a-z, A-Z, 0-9, ., -, _), even make them all lowercase for visual purposes.

当您保存用户上传的文件时请记住这一点:只需将文件名过滤为基本字母和标点符号,您几乎没有麻烦(a-z, A-Z, 0-9, ., -, _),甚至出于视觉目的将它们全部设为小写。

If you feel that this degrades usability and the file-system does not offer the unicode range of characters as of UTF-8, you can fallback to simple encodings like rawurlencode(Percent-Encoding, triplet) and offer files to download by resolving that name to disk.

如果您觉得这降低了可用性并且文件系统不提供 UTF-8 的 unicode 字符范围,您可以回退到简单的编码,如rawurlencode(百分比编码,三元组)并通过将该名称解析为提供文件下载盘。

Normally you just need to deal with what you have. Start asking a common sysadmin or programmer about character encoding and most will tell you that they are not really interested. Naturally that's subjective, but if you need someone to configure something for you, this can make a difference.

通常你只需要处理你所拥有的。开始询问普通的系统管理员或程序员关于字符编码的问题,大多数人会告诉你他们并不真正感兴趣。当然,这是主观的,但如果您需要有人为您配置某些东西,这可能会有所作为。

HTML

HTML

This is merely independent to PHP, it's about the output your scripts provide so the field of work.

这仅与 PHP 无关,它与您的脚本提供的输出有关,因此是工作领域。

Rule of thumb is: Specify it. If you didn't specifiy it (HTML files, CSS files, Javascript files) don't expect it to work precisely. Just do it then. Encoding is a chain, if there are many components, ensure that each knows about it's encoding. Otherwise browsers can only guess. UTF-8is a good choice so, but our job is to take care and make this precise and well defined.

经验法则是:指定它。如果您没有指定它(HTML 文件、CSS 文件、Javascript 文件),请不要指望它能够准确地工作。那就去做吧。编码是一个链,如果有很多组件,确保每个组件都知道它的编码。否则浏览器只能猜测。UTF-8是一个不错的选择,但我们的工作是注意并使其精确和明确定义。

PHP Settings

PHP 设置

As a general rule of thumb, start reading the php.inifile that ships with the PHP package of your linux distro. It comes with readable documentation in it's comments and further links. Some settings that come to my mind:

作为一般经验法则,开始阅读php.iniLinux 发行版的 PHP 包附带的文件。它在它的评论和进一步的链接中带有可读的文档。我想到的一些设置:

Strings

字符串

  • StringsDocs- By default strings in PHP are binary. As long as you use them with binary safefunctions, you get what you expect. Since PHP 5.2.1 you can cast strings explicitly to binary strings. That's for forward compatibility of the said PHP 6 unicode support: $binary = (binary) $string;or $binary = b"binary string";.
  • mb_internal_encoding()Docs- Gain or set it; mbstring.internal_encodingINI. The internal encoding is the character encoding name used for the HTTP input character encoding conversion, HTTP output character encoding conversion, and the default character encoding for string functions defined by the mbstring module.
  • iconv_set_encoding()Docs- Comparable for the iconv extension. See as well the iconv configuration settings.
  • Various:Some functions that deal with character sequences allow you to specify a charset encoding. For example htmlspecialcharsDocs. Make use of these parameters and check the docs for their default value. Often it is ISO-8859-1but you're looking for UTF-8. Other functions like html_entity_decodeDocsare using UTF-8per default. Some like htmlspecialchars_decodedo not specify a charset at all, so you need to read the PHP source-codefor a concrete specific understanding of how the function deals with the (binary) string.

To answer your question: The need of settings and parameters always depend on the components you use. For the general ones like the browser or the webserver, it's possible to give recommendation settings to get it configured for UTF-8. But with everything else it depends. The most important thing is to look for it and to ensure that you know the encoding and can configure/specify it. Often it's documented. As long as you don't need to deal with portable code, this is much simpler as you have control of the environment or you need to deal with a specific environment only. Write code defensively with encoding in mind and you should be fine.

回答您的问题:设置和参数的需要始终取决于您使用的组件。对于一般的浏览器或网络服务器,可以提供推荐设置以使其配置为UTF-8. 但对于其他一切,它取决于。最重要的是寻找它并确保您知道编码并且可以配置/指定它。通常它被记录在案。只要您不需要处理可移植代码,这就会简单得多,因为您可以控制环境或者您只需要处理特定的环境。牢记编码,防御性地编写代码,你应该没问题。

回答by Karolis

  1. All your files have to be saved in UTF-8 (without BOM)using your code editor.
  2. Webserver may be configured to send inappropriate headers, so it's recommended to override them in application level. For instance:

    header('Content-Type: text/html; charset=utf-8');
    
  3. Add HTML meta content-type:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    
  4. Use htmlspecialchars()instead of htmlentities()because the former is enough in utf-8 and the latter is incompatible with utf-8 by default.

  5. Tend not to use PHP standard string functions because many of them are incompatible with utf-8. Try to find their counterparts in Multibyte Stringor other libraries. (Don't forget to set default charset for the library before using it because the library supports many encodings and utf-8 is just one of them.)
  6. For regular expressions use umodifier. For example:

    preg_match('/?{3,5}/u', $string, $matches);
    

    Together this is the most reliable way to check if the given string is valid utf-8 string:

    if (@preg_match('//u', $string) === false) {
        // NOT valid!
    } else {
        // Valid!
    }
    
  7. If you use the database then always set appropriate connection encoding right after the connection is made. Example for MySQL:

    mysql_set_charset('utf8', $link);
    

    Also check if columns in the database are in utf-8. It's not always needed but recomended.

  1. 您的所有文件都必须使用代码编辑器以UTF-8(无 BOM)格式保存。
  2. Webserver 可能被配置为发送不适当的标头,因此建议在应用程序级别覆盖它们。例如:

    header('Content-Type: text/html; charset=utf-8');
    
  3. 添加 HTML 元内容类型:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    
  4. 使用htmlspecialchars()而不是htmlentities()因为前者在 utf-8 中就足够了,而后者默认与 utf-8 不兼容。

  5. 尽量不要使用 PHP 标准字符串函数,因为其中许多与 utf-8 不兼容。尝试在多字节字符串或其他库中找到它们的对应项。(在使用库之前不要忘记为库设置默认字符集,因为库支持许多编码,utf-8 只是其中之一。)
  6. 对于正则表达式,请使用u修饰符。例如:

    preg_match('/?{3,5}/u', $string, $matches);
    

    总之,这是检查给定字符串是否为有效 utf-8 字符串的最可靠方法:

    if (@preg_match('//u', $string) === false) {
        // NOT valid!
    } else {
        // Valid!
    }
    
  7. 如果您使用数据库,则始终在建立连接后立即设置适当的连接编码。MySQL 示例:

    mysql_set_charset('utf8', $link);
    

    还要检查数据库中的列是否为 utf-8。它并不总是需要,但推荐。

回答by TMS

Basically I do three things to work correctly with czech language:

基本上我会做三件事来正确使用捷克语:

1) define locale in PHP:

1)在PHP中定义语言环境:

setlocale(LC_COLLATE, "cs_CZ");
setlocale(LC_CTYPE, "cs_CZ");

so you would use something like:

所以你会使用类似的东西:

setlocale(LC_ALL, "en_US.utf8");
setlocale(LC_ALL, "nl_NL.utf8");

based on language which is currently switched to.

基于当前切换到的语言。

2) define charset for the database:

2)为数据库定义字符集:

mysql_query("set names latin2 collate latin2_czech_cs");

3) define the charset of PHP/HTML code:

3)定义PHP/HTML代码的字符集:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">

I don't use any .htaccess setting. You can modify this for your case, in locale use something like en_US.utf8(based on language currently which is currently switched to), in charset use utf-8 instead of latin2/iso-8859-2 and it should work well.

我不使用任何 .htaccess 设置。您可以针对您的情况修改它,在语言环境中使用类似en_US.utf8(基于当前切换到的语言),在字符集中使用 utf-8 而不是 latin2/iso-8859-2 并且它应该可以正常工作。

回答by djdy

Try one of the following:

尝试以下方法之一:

 AddDefaultCharset UTF-8
 AddCharset UTF-8 .php