MySQL 在 UTF8 和 Latin1 表中将 iso-8859-1 数据转换为 UTF-8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19497066/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 19:11:11  来源:igfitidea点击:

Converting iso-8859-1 data to UTF-8 in UTF8 and Latin1 tables

mysqldatabaseutf-8character-encodingpercona

提问by David

Problem Summary:

问题总结:

While trying to convert a site with mysql database from latin1 to utf8, some special characters are not displaying correctly despite ensuring charsets are all utf8 system wide.

在尝试将带有 mysql 数据库的站点从 latin1 转换为 utf8 时,尽管确保字符集都是 utf8 系统范围,但某些特殊字符无法正确显示。

Problem Details:

问题详情:

This is a common problem. But I seem to have an added complexity.

这是一个常见的问题。但我似乎有一个额外的复杂性。

Years ago, a oblivious developer (me), put a site together with MySQL. Some tables were setup with latin1_swedish_ci and utf8_general_ci. All input/display was done via pages with iso-8859-1 charset.

多年前,一个不经意的开发人员(我)将一个站点与 MySQL 放在一起。一些表是用 latin1_swedish_ci 和 utf8_general_ci 设置的。所有输入/显示都是通过带有 iso-8859-1 字符集的页面完成的。

Now, I have the task of turning all this data into utf-8 and thus finally uniforming the encoding. However, i'm having issues with a number of special characters in both instances (ie: ü). The characters don't seem to display correctly on a UTF-8 page. They display as ?.Instead When viewing the data in a utf8 table in mysql query browser, a correctly entered utf8'd 'u' displays as some special characters, while an incorrectly latin1 'u' displays as it should appear on page. But it doesn't.

现在,我的任务是将所有这些数据转换为 utf-8,从而最终统一编码。但是,我在这两种情况下都遇到了许多特殊字符的问题(即:ü)。字符似乎无法在 UTF-8 页面上正确显示。它们显示为 ?。相反,当在 mysql 查询浏览器中查看 utf8 表中的数据时,正确输入的 utf8'd 'u' 显示为一些特殊字符,而错误的 latin1 'u' 显示为它应该出现在页面上。但事实并非如此。

I've tried a number of things:

我尝试了很多事情:

  1. Percona script: https://github.com/rlowe/mysql_convert_charset
  2. converting col to binary and then to utf8
  3. converting utf8 tables to latin and then repeat above process
  1. Percona 脚本:https: //github.com/rlowe/mysql_convert_charset
  2. 将 col 转换为二进制,然后转换为 utf8
  3. 将 utf8 表转换为拉丁语,然后重复上述过程

Nothing seems to cure the data.

似乎没有什么可以治愈数据。

Dumping the entire database and important isn't really a viable option as it's a huge database now and downtime is restricted.

转储整个数据库和重要的并不是一个真正可行的选择,因为它现在是一个巨大的数据库并且停机时间受到限制。

UPDATE (22-Oct-2013)

更新(2013 年 10 月 22 日)

I've taken @deceze suggestions and reviewed all my content encoding areas as per http://kunststube.net/frontback/. I did find a few places in which I was still passing/encoding data in latin1. So, i've now changed it all over to UTF-8. However, the data is still displaying incorrectly in a particular field. In a table which is in utf8 (no columns have implicit encoding), field1 is in latin1. I can confirm this by running the following which displays the text correctly:

我已经采纳了@deceze 的建议,并按照http://kunststube.net/frontback/了我所有的内容编码区域。我确实找到了一些我仍然在 latin1 中传递/编码数据的地方。所以,我现在已经把它全部改成了 UTF-8。但是,数据在特定字段中仍然显示不正确。在 utf8 中的表中(没有列具有隐式编码),field1 在 latin1 中。我可以通过运行以下正确显示文本的命令来确认这一点:

select convert(cast(convert(field1 using latin1) as binary) using utf8) from my table WHERE id = 1

从我的表中选择 convert(cast(convert(field1 using latin1) as binary) using utf8) WHERE id = 1

This will convert Hahnem??hle to Hahnemühle.

这会将 Hahnem??hle 转换为 Hahnemühle。

In field2, it appears the data is in a different (unknown) encoding. The query above, when used on field2 converts Hahnem???hle to Hahnem?hle. I've gone through all the charsets on http://dev.mysql.com/doc/refman/5.5/en/charset-charsets.htmlreplacing latin1 but none seem to spit out the data correctly.

在 field2 中,数据似乎采用不同的(未知)编码。上面的查询在 field2 上使用时将 Hahnem???hle 转换为 Hahnem?hle。我已经浏览了http://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html上的所有字符集替换 latin1 但似乎没有一个正确地吐出数据。

回答by Gigamegs

You can try mysqldump to convert from ISO-8859-1 to utf-8:

您可以尝试使用 mysqldump 将 ISO-8859-1 转换为 utf-8:

mysqldump --user=username --password=password --default-character-set=latin1 --skip-set-charset dbname > dump.sql
chgrep latin1 utf8 dump.sql (or when you prefer  sed -i "" 's/latin1/utf8/g' dump.sql) 
mysql --user=username --password=password --execute="DROP DATABASE dbname; CREATE DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;"
mysql --user=username --password=password --default-character-set=utf8 dbname < dump.sql

回答by deceze

Setting a column to latin1and others to utf8is perfectly fine in MySQL. There's no problem to be solved here as such. This charset parameter just influences how the data is stored internally. Which of course also means that you cannot store, for example, "漢字" in a latin1column. But assuming you're just storing "Latin-1 characters" in there, that's fine.

在 MySQL 中将列设置为latin1和其他列utf8是完全没问题的。这样就没有问题需要解决了。此字符集参数仅影响数据在内部的存储方式。这当然也意味着您不能在latin1列中存储例如“汉字” 。但假设您只是在其中存储“Latin-1 字符”,那很好。

MySQL has something commonly called the connection encoding. It tells MySQL what encoding text is in that you send to it from PHP (or elsewhere), and what encoding you'd like back when retrieving data from MySQL. The column charset, the "input connection encoding" and "output connection encoding" can all be different things, MySQL will convert encodings on the fly accordingly as needed.

MySQL 有一些通常称为连接编码的东西。它告诉 MySQL 您从 PHP(或其他地方)发送给它的编码文本是什么,以及从 MySQL 检索数据时您希望返回什么编码。列字符集、“输入连接编码”和“输出连接编码”都可以是不同的东西,MySQL 会根据需要即时转换编码。

So, assuming you've used the correct connection encodings so far and data is stored properly in your database and you've not tried to store non-Latin-1 characters in Latin-1 columns, all you need to do to update your column charsets to UTF-8 is:

因此,假设您到目前为止使用了正确的连接编码并且数据正确存储在您的数据库中,并且您没有尝试在 Latin-1 列中存储非拉丁 1 字符,那么您需要做的就是更新您的列UTF-8 的字符集是:

ALTER TABLE table MODIFY column TEXT [...] CHARACTER SET utf8;

回答by rob

You may get rid of the "glyph" characters (?) by applying utf8_encode to the string before displaying it in your page.

在将字符串显示在页面中之前,您可以通过将 utf8_encode 应用于字符串来摆脱“字形”字符 (?)。