postgresql 用于编码“UTF8”的无效字节序列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4867272/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 22:48:57  来源:igfitidea点击:

invalid byte sequence for encoding "UTF8"

postgresqlimport

提问by mpen

I'm trying to import some datainto my database. So I've created a temporary table,

我正在尝试将一些数据导入我的数据库。所以我创建了一个临时表,

create temporary table tmp(pc varchar(10), lat decimal(18,12), lon decimal(18,12), city varchar(100), prov varchar(2));

And now I'm trying to import the data,

现在我正在尝试导入数据

 copy tmp from '/home/mark/Desktop/Canada.csv' delimiter ',' csv

But then I get the error,

但后来我得到了错误,

ERROR:  invalid byte sequence for encoding "UTF8": 0xc92c

How do I fix that? Do I need to change the encoding of my entire database (if so, how?) or can I change just the encoding of my tmptable? Or should I attempt to change the encoding of the file?

我该如何解决?我是否需要更改整个数据库的编码(如果是这样,如何更改?)还是可以只更改tmp表的编码?或者我应该尝试更改文件的编码?

采纳答案by Mike Sherrill 'Cat Recall'

If you need to store UTF8 data in your database, you need a database that accepts UTF8. You can check the encoding of your database in pgAdmin. Just right-click the database, and select "Properties".

如果需要在数据库中存储 UTF8 数据,则需要一个接受 UTF8 的数据库。您可以在 pgAdmin 中检查数据库的编码。只需右键单击数据库,然后选择“属性”。

But that error seems to be telling you there's some invalid UTF8 data in your source file. That means that the copyutility has detected or guessed that you're feeding it a UTF8 file.

但该错误似乎是在告诉您源文件中有一些无效的 UTF8 数据。这意味着该copy实用程序已检测到或猜测您正在向它提供 UTF8 文件。

If you're running under some variant of Unix, you can check the encoding (more or less) with the fileutility.

如果您在某些 Unix 变体下运行,您可以使用该file实用程序检查编码(或多或少)。

$ file yourfilename
yourfilename: UTF-8 Unicode English text

(I think that will work on Macs in the terminal, too.) Not sure how to do that under Windows.

(我认为这也适用于终端中的 Mac。)不确定如何在 Windows 下执行此操作。

If you use that same utility on a file that came from Windows systems (that is, a file that's notencoded in UTF8), it will probably show something like this:

如果您在来自 Windows 系统的文件(即以 UTF8 编码的文件)上使用相同的实用程序,它可能会显示如下内容:

$ file yourfilename
yourfilename: ASCII text, with CRLF line terminators

If things stay weird, you might try to convert your input data to a known encoding, to change your client's encoding, or both. (We're really stretching the limits of my knowledge about encodings.)

如果事情仍然很奇怪,您可能会尝试将输入数据转换为已知编码,更改客户端的编码,或两者兼而有之。(我们真的在扩展我对编码知识的限制。)

You can use the iconvutility to change encoding of the input data.

您可以使用该iconv实用程序更改输入数据的编码。

iconv -f original_charset -t utf-8 originalfile > newfile

You can change psql (the client) encoding following the instructions on Character Set Support. On that page, search for the phrase "To enable automatic character set conversion".

您可以按照字符集支持的说明更改 psql(客户端)编码。在该页面上,搜索短语“启用自动字符集转换”。

回答by Nobu

psql=# copy tmp from '/path/to/file.csv' with delimiter ',' csv header encoding 'windows-1251';

Adding encodingoption worked in my case.

添加encoding选项在我的情况下有效。

回答by mpen

Apparently I can just set the encodingon the fly,

显然我可以即时设置编码

 set client_encoding to 'latin1'

And then re-run the query. Not sure what encoding I should be using though.

然后重新运行查询。不确定我应该使用什么编码。



latin1made the characters legible, but most of the accented characters were in upper-case where they shouldn't have been. I assumed this was due to a bad encoding, but I think its actually the data that was just bad. I ended up keeping the latin1 encoding, but pre-processing the data and fixed the casing issues.

latin1使字符清晰易读,但大多数重音字符都是大写的,而不应该是大写的。我认为这是由于错误的编码,但我认为它实际上是糟糕的数据。我最终保留了 latin1 编码,但对数据进行了预处理并修复了大小写问题。

回答by Abdellah Alaoui

If you are ok with discarding nonconvertible characters, you can use -cflag

如果您可以丢弃不可转换的字符,则可以使用-c标志

iconv -c -t utf8 filename.csv > filename.utf8.csv

and then copy them to your table

然后将它们复制到您的桌子上

回答by Yuri Levinsky

This error means that records encoding in the file is different with respect to the connection. In this case iconv may return the error, sometimes even despite //IGNORE flag:

此错误意味着文件中的记录编码与连接不同。在这种情况下, iconv 可能会返回错误,有时即使是 //IGNORE 标志:

iconv -f ASCII -t utf-8//IGNORE < b.txt > /a.txt

iconv -f ASCII -t utf-8//忽略<b.txt>/a.txt

iconv: illegal input sequence at position (some number)

iconv:位置(某个数字)处的非法输入序列

The trick is to find incorrect characters and replace it. To do it on Linux use "vim" editor:

诀窍是找到不正确的字符并替换它。要在 Linux 上执行此操作,请使用“vim”编辑器:

vim (your text file), press "ESC": button and type ":goto (number returned by iconv)"

vim(你的文本文件),按“ESC”:按钮并输入“:goto(iconv返回的数字)”

To find non ASCII characters you may use the following command:

要查找非 ASCII 字符,您可以使用以下命令:

grep --color='auto' -P "[\x80-\xFF]"

grep --color='auto' -P "[\x80-\xFF]"

If you remove incorrect characters please check if you really need to convert your file: probably the problem is already solved.

如果您删除了不正确的字符,请检查您是否真的需要转换您的文件:可能问题已经解决了。

回答by BobG

It depends on what type of machine/encoding generated your import file.

这取决于生成导入文件的机器/编码类型。

If you're getting it from an English or Western European version of Windows, your best bet is probably setting it to 'WIN1252'. If you are getting it from a different source, consult the list of character encodings here:

如果您是从英文版或西欧版的 Windows 获得它,最好的办法可能是将其设置为“WIN1252”。如果您从其他来源获得它,请在此处查阅字符编码列表:

http://www.postgresql.org/docs/8.3/static/multibyte.html

http://www.postgresql.org/docs/8.3/static/multibyte.html

If you're getting it from a Mac, you may have to run it through the "iconv" utility first to convert it from MacRoman to UTF-8.

如果您是从 Mac 获取它,则可能必须先通过“iconv”实用程序运行它才能将其从 MacRoman 转换为 UTF-8。

回答by Vishal Chhatwani

Well I was facing the same problem. And what solved my problem is this:

好吧,我面临着同样的问题。解决我的问题的是:

In excel click on Save as. From save as type, choose .csvClick on Tools. Then choose web optionsfrom drop down list. Under Encodingtab, save the document as Unicode(UTF-8). Click OK. Save the file. DONE !

在excel中点击另存为。从另存为类型中,选择.csv单击工具。然后从下拉列表中选择Web 选项。在编码选项卡下,将文档另存为Unicode(UTF-8)。单击确定。保存文件。完毕 !

回答by Ramesh R

follow the below steps to solve this issue in pgadmin:

请按照以下步骤在 pgadmin 中解决此问题:

  1. SET client_encoding = 'ISO_8859_5';

  2. COPY tablename(column names) FROM 'D:/DB_BAK/csvfilename.csv' WITH DELIMITER ',' CSV ;

  1. SET client_encoding = 'ISO_8859_5';

  2. COPY tablename(column names) FROM 'D:/DB_BAK/csvfilename.csv' WITH DELIMITER ',' CSV ;

回答by Ed Doerr

I had the same problem, and found a nice solution here: http://blog.e-shell.org/134

我遇到了同样的问题,并在这里找到了一个很好的解决方案:http: //blog.e-shell.org/134

This is caused by a mismatch in your database encodings, surely because the database from where you got the SQL dump was encoded as SQL_ASCII while the new one is encoded as UTF8. .. Recode is a small tool from the GNU project that let you change on-the-fly the encoding of a given file.

这是由于您的数据库编码不匹配造成的,这肯定是因为您获得 SQL 转储的数据库被编码为 SQL_ASCII,而新的被编码为 UTF8。.. Recode 是 GNU 项目中的一个小工具,可让您即时更改给定文件的编码。

So I just recoded the dumpfile before playing it back:

所以我只是在播放之前重新编码转储文件:

postgres> gunzip -c /var/backups/pgall_b1.zip | recode iso-8859-1..u8 | psql test

In Debian or Ubuntu systems, recode can be installed via package.

在 Debian 或 Ubuntu 系统中,可以通过包安装 recode。

回答by Richard Greenwood

You can replace the backslash character with, for example a pipe character, with sed.

您可以使用 sed 替换反斜杠字符,例如管道字符。

sed -i -- 's/\/|/g' filename.txt