Postgresql COPY 编码,怎么做?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30916853/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Postgresql COPY encoding, how to?
提问by Blnpwr
I am importing a .txt file that contains imdb information(such as moviename, movieid, actors, directors, rating votes etc) I imported it by using the COPY Statement. I am using Ubuntu 64 bit. The problem is, that there are actors having different names, such as Jonas ?kerlund. That is why postgresql throws an error:
我正在导入一个包含 imdb 信息(例如电影名称、电影 ID、演员、导演、评级投票等)的 .txt 文件,我使用 COPY 语句导入了它。我正在使用 Ubuntu 64 位。问题是,有些演员有不同的名字,比如乔纳斯·克伦德。这就是 postgresql 抛出错误的原因:
ERROR: missing data for column "actors" CONTEXT: COPY movies, line 3060: "tt0283003 Spun 2002 6.8 30801 101 mins. Jonas ?" ********** Error **********
ERROR: missing data for column "actors" SQL state: 22P04 Context: COPY movies, line 3060: "tt0283003 Spun 2002 6.8 30801 101 mins. Jonas ?"
错误:缺少“演员”列的数据上下文:复制电影,第 3060 行:“tt0283003 纺 2002 6.8 30801 101 分钟。乔纳斯?” ********** 错误 **********
错误:列“演员”的缺失数据 SQL 状态:22P04 上下文:复制电影,第 3060 行:“tt0283003 Spun 2002 6.8 30801 101 分钟。乔纳斯?”
My copy statement looks like this:
我的复制语句如下所示:
COPY movie FROM '/home/max/Schreibtisch/imdb_top100t.txt' (DELIMITER E'\t', FORMAT CSV, NULL '');
I do not exactly know, how to use the collation statement. Could you help me please? As always, thank you.
我不完全知道,如何使用整理语句。请问你能帮帮我吗?一如既往,谢谢你。
回答by Nick Barnes
Collation only determines how strings are sorted. The important thing when loading and saving them is the encoding.
排序规则仅确定字符串的排序方式。加载和保存它们时重要的是编码。
By default, Postgres uses your client_encoding
setting for COPY
commands; if it doesn't match the encoding of the file, you'll run into problems like this.
默认情况下,Postgres 使用您client_encoding
的COPY
命令设置;如果它与文件的编码不匹配,您将遇到这样的问题。
You can see from the message that while trying to read the "?", Postgres first read an "?", and then encountered some kind of error. The UTF8 byte sequence for "?" is C3 85. C3 happens to be "?" in the LATIN1codepage, while 85 is undefined*. So it's highly likely that the file is UTF8, but being read as if it were LATIN1.
从消息中可以看到,在尝试读取“?”时,Postgres首先读取了“?”,然后遇到了某种错误。“?”的 UTF8 字节序列 是C3 85。C3 恰好是“?” 在LATIN1代码页中,而 85 未定义*。因此,该文件很可能是 UTF8,但被读取时就好像它是 LATIN1。
It should be as simple as specifying the appropriate encoding in the COPY
command:
它应该像在COPY
命令中指定适当的编码一样简单:
COPY movie FROM '/home/max/Schreibtisch/imdb_top100t.txt'
(DELIMITER E'\t', FORMAT CSV, NULL '', ENCODING 'UTF8');
*I believe Postgres actually maps these "gaps" in LATIN1 to the corresponding Unicode code points. 85 becomes U+0085, a.k.a. "NEXT LINE", which explains why it was treated as a CSV row terminator.
*我相信 Postgres 实际上将 LATIN1 中的这些“空白”映射到相应的 Unicode 代码点。85 变成U+0085,又名“NEXT LINE”,这解释了为什么它被视为 CSV 行终止符。