Postgresql varchar 是否使用 unicode 字符长度或 ASCII 字符长度计数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4249745/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-20 00:24:20  来源:igfitidea点击:

Does Postgresql varchar count using unicode character length or ASCII character length?

postgresqlunicode

提问by Ben Lopatin

I tried importing a database dump from a SQL file and the insert failed when inserting the string Mérinto a field defined as varying(3). I didn't capture the exact error, but it pointed to that specific value with the constraint of varying(3).

我尝试从 SQL 文件导入数据库转储,但在将字符串Mér插入定义为varying(3). 我没有捕捉到确切的错误,但它指向了具有varying(3).

Given that I considered this unimportant to what I was doing at the time, I just changed the value to Mer, it worked, and I moved on.

鉴于我认为这对我当时正在做的事情不重要,我只是将值更改为Mer,它起作用了,然后我继续前进。

Is a varyingfield with its limit taking into account length of the byte string? What really boggles my mind is that this was dumped from another PostgreSQL database. So it doesn't make sense how a constraint could allow the value to be written initially.

是否varying考虑了字节字符串的长度限制的字段?真正让我难以置信的是,这是从另一个 PostgreSQL 数据库中转储的。因此,约束如何允许最初写入值是没有意义的。

回答by araqnid

The length limit imposed by varchar(N)types and calculated by the lengthfunction is in characters, not bytes. So 'abcdef'::char(3)is truncated to 'abc'but 'acdef'::char(3)is truncated to 'ac', even in the context of a database encoded as UTF-8, where 'ac'is encoded using 5 bytes.

varchar(N)类型强加并由length函数计算的长度限制以字符为单位,而不是字节。So'abcdef'::char(3)被截断为'abc''acdef'::char(3)被截断为'ac',即使在编码为 UTF-8 的数据库的上下文中,其中'ac'使用 5 个字节进行编码。

If restoring a dump file complained that 'Mér'would not go into a varchar(3)column, that suggests you were restoring a UTF-8 encoded dump file into a SQL_ASCII database.

如果还原转储文件抱怨'Mér'不会进入varchar(3)列,则表明您正在将 UTF-8 编码的转储文件还原到 SQL_ASCII 数据库中。

For example, I did this in a UTF-8 database:

例如,我在 UTF-8 数据库中执行此操作:

create schema so4249745;
create table so4249745.t(key varchar(3) primary key);
insert into so4249745.t values('Mér');

And then dumped this and tried to load it into a SQL_ASCII database:

然后转储它并尝试将其加载到 SQL_ASCII 数据库中:

pg_dump -f dump.sql --schema=so4249745 --table=t
createdb -E SQL_ASCII -T template0 enctest
psql -f dump.sql enctest

And sure enough:

果然:

psql:dump.sql:34: ERROR:  value too long for type character varying(3)
CONTEXT:  COPY t, line 1, column key: "Mér"

By contrast, if I create the database enctest as encoding LATIN1 or UTF8, it loads fine.

相比之下,如果我将数据库 enctest 创建为编码 LATIN1 或 UTF8,它加载得很好。

This problem comes about because of a combination of dumping a database with a multi-byte character encoding, and trying to restore it into a SQL_ASCII database. Using SQL_ASCII basically disables the transcoding of client data to server data and assumes one byte per character, leaving it to the clients to take responsibility for using the right character map. Since the dump file contains the stored string as UTF-8, that is four bytes, so a SQL_ASCII database sees that as four characters, and therefore regards it as violating the constraint. And it prints out the value, which my terminal then reassembles as three characters.

出现此问题是因为转储具有多字节字符编码的数据库并尝试将其恢复到 SQL_ASCII 数据库的组合。使用 SQL_ASCII 基本上禁用了客户端数据到服务器数据的转码,并假设每个字符一个字节,让客户端负责使用正确的字符映射。由于转储文件包含存储的 UTF-8 字符串,即四个字节,因此 SQL_ASCII 数据库将其视为四个字符,因此将其视为违反约束。它打印出值,然后我的终端将其重新组合为三个字符。

回答by vasquez

It depends what value you used when you created the database. createdb -E UNICODEcreates a Unicode DB that should also accept multibyte characters and count them as one character.

这取决于您在创建数据库时使用的值。createdb -E UNICODE创建一个 Unicode DB,它也应该接受多字节字符并将它们算作一个字符。

You can use

您可以使用

psql -l

to see which encoding was used. This pagehas a table including information about how many bytes per character are used.

查看使用了哪种编码。 该页面有一个表格,其中包含有关每个字符使用多少字节的信息。