php 阿拉伯字符编码问题:UTF-8 与 Windows-1256

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8674121/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-26 05:14:29  来源:igfitidea点击:

Arabic Character Encoding Issue: UTF-8 versus Windows-1256

phpdatabaseutf-8character-encoding

提问by ThisLanham

Quick Background: I inherited a large sql dump file containing a combination of english and arabic text and (I think) it was originally exported using 'latin1'. I changed all occurrences of 'latin1' to 'utf8' prior to importing the file. The the arabic text didn't appear correctly in phpmyadmin (which I guess is normal), but when I loaded the text to a web page with the following...

快速背景:我继承了一个包含英语和阿拉伯语文本组合的大型 sql 转储文件,并且(我认为)它最初是使用“latin1”导出的。在导入文件之前,我将所有出现的 'latin1' 更改为 'utf8'。阿拉伯语文本在 phpmyadmin 中没有正确显示(我猜这是正常的),但是当我将文本加载到具有以下内容的网页时...

<meta http-equiv='Content-Type' content='text/html; charset=windows-1256'/> 

...everything looked good and the arabic text displayed perfectly.

...一切看起来都很好,阿拉伯文字显示完美。


Problem: My client is really really really picky and doesn't want to change his...


问题:我的客户真的真的很挑剔,不想改变他的...

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

...to the 'Windows-1256' equivalent. I didn't think this would be a problem, but when I changed the charset value to 'UTF-8', all of the arabic characters appeared as diamonds with question marks. Shouldn't UTF-8 display arabic text correctly?

...到“Windows-1256”等价物。我不认为这会成为问题,但是当我将字符集值更改为“UTF-8”时,所有阿拉伯字符都显示为带问号的菱形。UTF-8 不应该正确显示阿拉伯文本吗?


Here are a few notes about my database configuration:


这里有一些关于我的数据库配置的注意事项:

  • Database charset is 'utf8'
  • Database connection collation is 'utf8_general_ci'
  • All databases, tables, and applicable fields have been collated as 'utf8_general_ci'
  • 数据库字符集是“utf8”
  • 数据库连接排序规则是“utf8_general_ci”
  • 所有数据库、表和适用字段都已整理为“utf8_general_ci”

I've been scouring stack overflow and other forums for anything the relates to my issue. I've found similar problems, but not of the solutions seem to work for my specific situation. Hope someone can help!

我一直在搜索堆栈溢出和其他论坛以查找与我的问题相关的任何内容。我发现了类似的问题,但似乎没有解决方案适用于我的特定情况。希望有人能帮忙!

回答by Jukka K. Korpela

If the document looks right when declared as windows-1256 encoded, then it most probably iswindows-1256 encoded. So it was apparently not exported using latin1—which would have been impossible, since latin1 has no Arabic letters.

如果文档在声明为 windows-1256 编码时看起来正确,那么它很可能windows-1256 编码的。所以它显然不是使用 latin1 导出的——这是不可能的,因为 latin1 没有阿拉伯字母。

If this is just about a single file, then the simplest way is to convert it from windows-1256 encoding to utf-8 encoding, using e.g. Notepad++. (Open the file in it, change the encoding, via File format menu, to Arabic, windows-1256. Then select Convert to UTF-8 in the File format menu and do File → Save.)

如果这只是一个文件,那么最简单的方法是使用例如Notepad++将其从 windows-1256 编码转换为 utf-8 编码。(打开其中的文件,通过文件格式菜单将编码更改为阿拉伯语,windows-1256。然后在文件格式菜单中选择转换为 UTF-8 并执行文件 → 保存。)

Windows-1256 and UTF-8 are completely different encodings, so data gets all messed up if you declare windows-1256 data as UTF-8 or vice versa. Only ASCII characters, such as English letters, have the same representation in both encodings.

Windows-1256 和 UTF-8 是完全不同的编码,所以如果你将 windows-1256 数据声明为 UTF-8,数据就会变得一团糟,反之亦然。只有 ASCII 字符(例如英文字母)在两种编码中具有相同的表示形式。

回答by Michael Dillon

I think you need to go back to square one. It sounds like you have a database dump in Win-1256 encoding and you want to work with it in UTF-8 from now on. It also sounds like you are using PHP but you have lots of irrelevant tags on your question and are missing the most important one, PHP.

我认为你需要回到第一个地方。听起来您有一个 Win-1256 编码的数据库转储,并且您想从现在开始以 UTF-8 使用它。听起来您正在使用 PHP,但您的问题中有很多不相关的标签,并且缺少最重要的标签 PHP。

First, you need to convert the text dump into UTF-8 and you should be able to do that with PHP. Chances are that your conversion script will have two steps, first read the Win-1256 bytes and decode them into internal Unicode text strings, then encode the Unicode text strings into UTF-8 bytes for output to a new text file.

首先,您需要将文本转储转换为 UTF-8,并且您应该可以使用 PHP 来完成此操作。您的转换脚本可能有两个步骤,首先读取 Win-1256 字节并将它们解码为内部 Unicode 文本字符串,然后将 Unicode 文本字符串编码为 UTF-8 字节以输出到新的文本文件。

Once you have done that, redo the database import as you did before, but now you have correctly encoded the input data as UTF-8.

完成后,像以前一样重做数据库导入,但现在您已将输入数据正确编码为 UTF-8。

After that it should be as simple as reading the database and rendering a web page with the correct UTF-8 encoding.

之后,它应该像读取数据库并使用正确的 UTF-8 编码呈现网页一样简单。

P.S. It is actually possible to reencode the data every time you display it, but that does not solve the problem of having a database full of incorrectly encoded data.

PS 实际上可以在每次显示数据时重新编码数据,但这并不能解决数据库充满错误编码数据的问题。

回答by ikegami

We can't find the error in your code if you don't show us your code, so we're very limited in how we can help you.

如果您不向我们展示您的代码,我们将无法在您的代码中找到错误,因此我们在如何帮助您方面非常有限。

You told the browser to interpret the document as being UTF-8 rather than Windows-1256, but did you actually change the encoding used from Windows-1256 to UTF-8?

您告诉浏览器将文档解释为 UTF-8 而不是 Windows-1256,但您实际上是否将使用的编码从 Windows-1256 更改为 UTF-8?

For example,

例如,

$ cat a.pl
use strict;
use warnings;
use feature qw( say );
use charnames ':full';

my $enc = $ARGV[0] or die;
binmode STDOUT, ":encoding($enc)";

print <<"__EOI__";
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=$enc">
<title>Foo!</title>
</head>
<body dir="rtl">
\N{ARABIC LETTER ALEF}\N{ARABIC LETTER LAM}\N{ARABIC LETTER AIN}\N{ARABIC LETTER REH}\N{ARABIC LETTER BEH}\N{ARABIC LETTER YEH}\N{ARABIC LETTER TEH MARBUTA}
</body>
</html>
__EOI__

$ perl a.pl UTF-8 > utf8.html

$ perl a.pl Windows-1256 > cp1256.html

回答by mostafa khansa

inorder to display arabic characters correctly , you need to convert your php file to utf-8 without Bomthis happened with me, arabic characters was displayed diamonds, but conversion to utf-8 without bom will solve this problem

为了正确显示阿拉伯字符,您需要将 php 文件转换为 utf-8而不使用 Bom这发生在我身上,阿拉伯字符显示为菱形,但转换为没有 bom 的 utf-8 将解决这个问题