MySQL:SELECT INTO 使用的字符编码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9764898/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 12:36:05  来源:igfitidea点击:

MySQL: character encoding used by SELECT INTO?

mysqlutf-8character-encoding

提问by David Wolever

I'm trying to export some data from a MySQL database, but weird and wonderful things are happening to unicode in that table.

我正在尝试从 MySQL 数据库中导出一些数据,但是该表中的 unicode 发生了奇怪而奇妙的事情。

I will focus on one character, the left smartquote: “

我将专注于一个字符,左边的 smartquote:“

When I use SELECTfrom the console, it is printed without issue:

当我SELECT从控制台使用时,它的打印没有问题:

mysql> SELECT text FROM posts;
+-------+
| text  |
+-------+
| “foo” |
+-------+

This means the data are being sent to my terminal as utf-8[0] (which is correct).

这意味着数据将作为 utf-8[0](这是正确的)发送到我的终端。

However, when I use SELECT * FROM posts INTO OUTFILE '/tmp/x.csv' …;, the output file is notcorrectly encoded:

但是,当我使用 时SELECT * FROM posts INTO OUTFILE '/tmp/x.csv' …;,输出文件正确编码:

$ cat /tmp/x.csv
a?fooa

Specifically, the is encoded with seven (7!) bytes: \xc3\xa2\xe2\x82\xac\xc5\x93.

具体来说,用七(7!)个字节编码:\xc3\xa2\xe2\x82\xac\xc5\x93

What encoding is this? Or how could I tell MySQL to use a less unreasonable encoding?

这是什么编码?或者我怎么能告诉 MySQL 使用不那么不合理的编码?

Also, some miscellaneous facts:

此外,一些杂项事实:

  • SELECT @@character_set_databasereturns latin1
  • The textcolumn is a VARCHAR(42):
    mysql> DESCRIBE posts;
    +-------+-------------+------+-----+---------+-------+
    | Field | Type        | Null | Key | Default | Extra |
    +-------+-------------+------+-----+---------+-------+
    | text  | varchar(42) | NO   | MUL |         |       |
    +-------+-------------+------+-----+---------+-------+
    
  • encoded as utf-8 yields \xe2\x80\x9c
  • \xe2\x80\x9cdecoded as latin1then re-encoded as utf-8yields \xc3\xa2\xc2\x80\xc2\x9c(6 bytes).
  • Another data point: (utf-8: \xe2\x80\xa6) is encoded to \xc3\xa2\xe2\x82\xac\xc2\xa6
  • SELECT @@character_set_database返回 latin1
  • text列是一个VARCHAR(42)
    mysql> DESCRIBE posts;
    +-------+-------------+------+-----+---------+-------+
    | Field | Type        | Null | Key | Default | Extra |
    +-------+-------------+------+-----+---------+-------+
    | text  | varchar(42) | NO   | MUL |         |       |
    +-------+-------------+------+-----+---------+-------+
    
  • 编码为 utf-8 产生 \xe2\x80\x9c
  • \xe2\x80\x9c解码为latin1然后重新编码为utf-8收益\xc3\xa2\xc2\x80\xc2\x9c(6 个字节)。
  • 另一个数据点:(utf-8:)\xe2\x80\xa6被编码为\xc3\xa2\xe2\x82\xac\xc2\xa6

[0]: as smart quotes aren't included in any 8-bit encoding, and my terminal correctly renders utf-8 characters.

[0]:因为智能引号不包含在任何 8 位编码中,并且我的终端正确呈现 utf-8 字符。

采纳答案by taavi

Many programs/standards (including MySQL) assume that "latin1" means "cp1252", so the 0x80 byte is interpreted as a Euro symbol, which is where that \xe2\x82\xacbit (U+20AC) comes from in the middle.

许多程序/标准(包括 MySQL)假定“latin1”表示“cp1252”,因此 0x80 字节被解释为欧元符号,这就是该\xe2\x82\xac位 (U+20AC) 来自中间的地方。

When I try this, it works properly (but note how I put data in, and the variables set on the db server):

当我尝试这个时,它工作正常(但请注意我如何放入数据以及在数据库服务器上设置的变量):

mysql> set names utf8; -- http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
mysql> create table sq (c varchar(10)) character set utf8;
mysql> show create table sq\G
*************************** 1. row ***************************
       Table: sq
Create Table: CREATE TABLE `sq` (
  `c` varchar(10) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.19 sec)

mysql> insert into sq values (unhex('E2809C'));
Query OK, 1 row affected (0.00 sec)

mysql> select hex(c), c from sq;
+--------+------+
| hex(c) | c    |
+--------+------+
| E2809C | “  |
+--------+------+
1 row in set (0.00 sec)

mysql> select * from sq into outfile '/tmp/x.csv';
Query OK, 1 row affected (0.02 sec)

mysql> show variables like "%char%";
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

And from the shell:

从外壳:

/tmp$ hexdump -C x.csv
00000000  e2 80 9c 0a                                       |....|
00000004

Hopefully there's a useful tidbit in there…

希望那里有一个有用的花絮......

回答by mvd

Newer versions of MySQL have an option to set the character set in the outfile clause:

较新版本的 MySQL 可以选择在 outfile 子句中设置字符集:

SELECT col1,col2,col3 
FROM table1 
INTO OUTFILE '/tmp/out.txt' 
CHARACTER SET utf8
FIELDS TERMINATED BY ','

回答by dhruvbird

I've found that this works well.

我发现这很有效。

SELECT convert(col_name USING latin1) FROM posts INTO OUTFILE '/tmp/x.csv' …;

回答by paczor

As you can see my MySQL database use latin1and system is utf-8.

如您所见,我的 MySQL 数据库使用latin1和系统是utf-8.

mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| character_set_client     | latin1 |
| character_set_connection | latin1 |
| character_set_database   | latin1 |
| character_set_filesystem | binary |
| character_set_results    | latin1 |
| character_set_server     | latin1 |
| character_set_system     | utf8   |
+--------------------------+--------+
7 rows in set (0.00 sec)

Every time I tried to export table I got strange encoded CSV file. So, I put:

每次我尝试导出表时,我都会得到奇怪的编码 CSV 文件。所以,我把:

mysql_query("SET NAMES CP1252");
header('Content-Type: text/csv; charset=cp1252');
header('Content-Disposition: attachment;filename=output.csv');

as in my export script.

就像在我的导出脚本中一样

Then I have pure UTF-8 output.

然后我有纯 UTF-8 输出。

回答by deceze

To specifically address your question "What is this?", you have answered it yourself:

为了具体解决您的问题“这是什么?”,您已经自己回答了:

I suspect this is because “Column values are dumped using the binary character set. In effect, there is no character set conversion.” - dev.mysql.com/doc/refman/5.0/en/select-into.html

我怀疑这是因为“列值是使用二进制字符集转储的。实际上,没有字符集转换。” - dev.mysql.com/doc/refman/5.0/en/select-into.html

That is the way MySQL stores utf8encoded data internally. It's a terribly inefficient variation of Unicode storage, apparently using a full three bytes for most characters, and not supporting four byte UTF-8 sequences.

这就是 MySQL 在utf8内部存储编码数据的方式。这是 Unicode 存储的一种非常低效的变体,显然大多数字符使用了完整的三个字节,并且不支持四字节的 UTF-8 序列。

As for how to convert it to real UTF-8 using INTO OUTFILE... I don't know. Using other mysqldumpmethods will do it though.

至于如何将其转换为真正的 UTF-8 使用INTO OUTFILE......我不知道。使用其他mysqldump方法会做到这一点。

回答by singpolyma

You can execute MySQL queries using the CLI tool (I believe even with an output format so it prints out CSV) and redirect to a file. Should do charset conversion and still give you access to do joins, etc.

您可以使用 CLI 工具执行 MySQL 查询(我相信即使使用输出格式,它也会打印出 CSV)并重定向到文件。应该进行字符集转换,并且仍然可以让您访问连接等。

回答by Hardeep

Try SET CHARACTER SET <blah>before your select, <blah>=utf8or latin1etc... See: http://dev.mysql.com/doc/refman/5.6/en/charset-connection.html

SET CHARACTER SET <blah>在您选择之前尝试,<blah>=utf8latin1等等...请参阅:http: //dev.mysql.com/doc/refman/5.6/en/charset-connection.html

Or SET NAMES utf8;might work...

或者SET NAMES utf8;可能工作...

回答by Burhan Khalid

You need to issue charset utf8at the MySQL prompt before running the SELECT. This tells the server what to output the results as.

你需要的问题charset utf8在MySQL的运行前提示SELECT。这告诉服务器将结果输出为什么。