Postgresql COPY 编码，怎么做？

Question

提问by Blnpwr

I am importing a .txt file that contains imdb information(such as moviename, movieid, actors, directors, rating votes etc) I imported it by using the COPY Statement. I am using Ubuntu 64 bit. The problem is, that there are actors having different names, such as Jonas ?kerlund. That is why postgresql throws an error:

我正在导入一个包含 imdb 信息（例如电影名称、电影 ID、演员、导演、评级投票等）的 .txt 文件，我使用 COPY 语句导入了它。我正在使用 Ubuntu 64 位。问题是，有些演员有不同的名字，比如乔纳斯·克伦德。这就是 postgresql 抛出错误的原因：

ERROR: missing data for column "actors" CONTEXT: COPY movies, line 3060: "tt0283003 Spun 2002 6.8 30801 101 mins. Jonas ?" ********** Error **********
ERROR: missing data for column "actors" SQL state: 22P04 Context: COPY movies, line 3060: "tt0283003 Spun 2002 6.8 30801 101 mins. Jonas ?"

错误：缺少“演员”列的数据上下文：复制电影，第 3060 行：“tt0283003 纺 2002 6.8 30801 101 分钟。乔纳斯？” ********** 错误 **********
错误：列“演员”的缺失数据 SQL 状态：22P04 上下文：复制电影，第 3060 行：“tt0283003 Spun 2002 6.8 30801 101 分钟。乔纳斯？”

My copy statement looks like this:

我的复制语句如下所示：

COPY movie FROM '/home/max/Schreibtisch/imdb_top100t.txt' (DELIMITER E'\t', FORMAT CSV, NULL '');

I do not exactly know, how to use the collation statement. Could you help me please? As always, thank you.

我不完全知道，如何使用整理语句。请问你能帮帮我吗？一如既往，谢谢你。

Answer 1

回答by Nick Barnes

Collation only determines how strings are sorted. The important thing when loading and saving them is the encoding.

排序规则仅确定字符串的排序方式。加载和保存它们时重要的是编码。

By default, Postgres uses your client_encodingsetting for COPYcommands; if it doesn't match the encoding of the file, you'll run into problems like this.

默认情况下，Postgres 使用您client_encoding的COPY命令设置；如果它与文件的编码不匹配，您将遇到这样的问题。

You can see from the message that while trying to read the "?", Postgres first read an "?", and then encountered some kind of error. The UTF8 byte sequence for "?" is C3 85. C3 happens to be "?" in the LATIN1codepage, while 85 is undefined*. So it's highly likely that the file is UTF8, but being read as if it were LATIN1.

从消息中可以看到，在尝试读取“？”时，Postgres首先读取了“？”，然后遇到了某种错误。“?”的 UTF8 字节序列是C3 85。C3 恰好是“？” 在LATIN1代码页中，而 85 未定义*。因此，该文件很可能是 UTF8，但被读取时就好像它是 LATIN1。

It should be as simple as specifying the appropriate encoding in the COPYcommand:

它应该像在COPY命令中指定适当的编码一样简单：

COPY movie FROM '/home/max/Schreibtisch/imdb_top100t.txt'
(DELIMITER E'\t', FORMAT CSV, NULL '', ENCODING 'UTF8');

*I believe Postgres actually maps these "gaps" in LATIN1 to the corresponding Unicode code points. 85 becomes U+0085, a.k.a. "NEXT LINE", which explains why it was treated as a CSV row terminator.

*我相信 Postgres 实际上将 LATIN1 中的这些“空白”映射到相应的 Unicode 代码点。85 变成U+0085，又名“NEXT LINE”，这解释了为什么它被视为 CSV 行终止符。

Postgresql COPY 编码，怎么做？

提问by Blnpwr

回答by Nick Barnes

相关推荐

最近更新

标签

Postgresql COPY 编码，怎么做？

提问by Blnpwr

回答by Nick Barnes

相关推荐

postgresql 在单个查询中更新多个表

如何使用 pgAdmin III 更改 PostgreSQL 数据库中的表

postgresql Node Sequelize 中预先加载模型的排序结果

postgresql 将 postgres 数据库从一台服务器复制到另一台服务器

相关推荐

最近更新

标签