如何检测 Latin1 编码列中的 UTF-8 字符 - MySQL

Question

提问by dinie

I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8.

我即将承担将数据库从 Latin1 转换为 UTF-8 的繁琐且充满陷阱的任务。

At this point I simply want to check what sort of data I have stored in my tables, as that will determine what approach I should use to convert the data.

在这一点上，我只想检查我存储在我的表中的数据类型，因为这将决定我应该使用什么方法来转换数据。

Specifically, I want to check if I have UTF-8 characters in the Latin1 columns, what would be the best way to do this? If only a few rows are affected, then I can just fix this manually.

具体来说，我想检查 Latin1 列中是否有 UTF-8 字符，这样做的最佳方法是什么？如果只有几行受到影响，那么我可以手动修复它。

Option 1. Perform a MySQL dump and use Perl to search for UTF-8 characters?

选项 1. 执行 MySQL 转储并使用 Perl 搜索 UTF-8 字符？

Option 2. Use MySQL CHAR_LENGTH to find rows with multi-byte characters? e.g. SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);Is this enough?

选项 2. 使用 MySQL CHAR_LENGTH 查找具有多字节字符的行？例如，SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);这就够了吗？

At the moment I have switched my Mysql client encoding to UTF-8.

目前我已将 Mysql 客户端编码切换为 UTF-8。

Answer 1

回答by tadman

Character encoding, like time zones, is a constant source of problems.

字符编码，就像时区一样，是一个不断出现问题的根源。

What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.

您可以做的是查找任何“高位 ASCII”字符，因为这些字符要么是 LATIN1 重音字符或符号，要么是 UTF-8 多字节字符的第一个。除非你作弊，否则分辨出差异并不容易。

To figure out what encoding is correct, you just SELECTtwo different versions and compare visually. Here's an example:

要弄清楚哪种编码是正确的，您只需SELECT两个不同的版本并进行视觉比较。下面是一个例子：

SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1, 
       CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8 
FROM users 
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')

This is made unusually complicated because the MySQL regexp engine seems to ignore things like \x80and makes it necessary to use the UNHEX()method instead.

这变得异常复杂，因为 MySQL regexp 引擎似乎忽略了诸如此类的事情\x80，因此必须使用该UNHEX()方法。

This produces results like this:

这会产生如下结果：

latin1                utf8
----------------------------------------
Bj??rn                Bj?rn

Answer 2

回答by deceze

Since your question is not completely clear, let's assume some scenarios:

由于您的问题不完全清楚，让我们假设一些场景：

Hitherto wrong connection:You've been connecting to your database incorrectly using the latin1 encoding, but have stored UTF-8 data in the database (the encoding of the column is irrelevant in this case). This is the case I described here. In this case, it's easy to fix: Dump the database contents to a file through a latin1 connection. This will translate the incorrectly stored data into incorrectly correctly stored UTF-8, the way it has worked so far (read the aforelinked article for the gory details). You can then reimport the data into the database through a correctly set utf8 connection, and it will be stored as it should be.
Hitherto wrong column encoding:UTF-8 data was inserted into a latin1 column through a utf8 connection. In that case forget it, the data is gone. Any non-latin1 character should be replaced by a ?.
Hitherto everything fine, henceforth added support for UTF-8:You have Latin-1 data correctly stored in a latin1 column, inserted through a latin1 connection, but want to expand that to also allow UTF-8 data. In that case just change the column encoding to utf8. MySQL will convert the existing data for you. Then just make sure your database connection is set to utf8 when you insert UTF-8 data.

迄今为止错误的连接：您一直使用 latin1 编码错误地连接到数据库，但在数据库中存储了 UTF-8 数据（在这种情况下，列的编码无关紧要）。这就是我在此处描述的情况。在这种情况下，很容易修复：通过 latin1 连接将数据库内容转储到文件中。这会将错误存储的数据转换为错误正确存储的 UTF-8，这是迄今为止的工作方式（阅读上述链接的文章以了解详细信息）。然后，您可以通过正确设置的 utf8 连接将数据重新导入到数据库中，数据将按原样存储。
迄今为止错误的列编码：UTF-8 数据通过 utf8 连接插入到 latin1 列中。那样的话算了吧，数据就没了。任何非 latin1 字符都应替换为?.
到目前为止一切都很好，此后添加了对 UTF-8 的支持：您将 Latin-1 数据正确存储在 latin1 列中，通过 latin1 连接插入，但想要扩展它以允许 UTF-8 数据。在这种情况下，只需将列编码更改为 utf8。MySQL 将为您转换现有数据。然后只需确保在插入 UTF-8 数据时将数据库连接设置为 utf8。

Answer 3

回答by Patrick James McDougle

There is a script on githubto help with this sort of a thing.

github 上有一个脚本可以帮助解决此类问题。

Answer 4

回答by tripleee

I would create a dump of the database and grep for all valid UTF8 sequences. Where to take it from there depends on what you get. There are multiple questions on SO about identifying invalid UTF8; you can basically just reverse the logic.

我将为所有有效的 UTF8 序列创建数据库和 grep 的转储。从那里拿走它取决于你得到什么。关于识别无效 UTF8 的 SO 有多个问题；你基本上可以颠倒逻辑。

Edit: So basically, any field consisting entirely of 7-bit ASCII is safe, and any field containing an invalid UTF-8 sequence can be assumed to be Latin-1. The remaining data should be inspected - if you are lucky, a handful of obvious substitutions will fix the absolute majority (replace ?? with Latin-1 ?, etc).

编辑：所以基本上，任何完全由 7 位 ASCII 组成的字段都是安全的，任何包含无效 UTF-8 序列的字段都可以假定为 Latin-1。应该检查剩余的数据 - 如果幸运的话，少数明显的替换将修复绝对多数（用拉丁语 1 替换 ?? 等）。

如何检测 Latin1 编码列中的 UTF-8 字符 - MySQL

提问by dinie

回答by tadman

回答by deceze

回答by Patrick James McDougle

回答by tripleee

相关推荐

最近更新

标签

如何检测 Latin1 编码列中的 UTF-8 字符 - MySQL

提问by dinie

回答by tadman

回答by deceze

回答by Patrick James McDougle

回答by tripleee

相关推荐

MySQL 如何将 mysqldump 的输出拆分为较小的文件？

mysql 中的简单“从视图创建表”语法？

MySQL 子查询与连接

MySQL - 我如何计算空值而不是空值？

相关推荐

最近更新

标签