Python 熊猫 to_csv：ascii 无法编码字符

Question

提问by ale19

I'm trying to read and write a dataframe to a pipe-delimited file. Some of the characters are non-Roman letters (`, ?, ?, etc.). But it breaks when I try to write out the accents as ASCII.

我正在尝试将数据帧读取和写入以管道分隔的文件。某些字符是非罗马字母（`、?、? 等）。但是当我尝试将重音写为 ASCII 时它会中断。

df = pd.read_csv('filename.txt',sep='|', encoding='utf-8')
<do stuff>
newdf.to_csv('output.txt', sep='|', index=False, encoding='ascii')

-------

  File "<ipython-input-63-ae528ab37b8f>", line 21, in <module>
    newdf.to_csv(filename,sep='|',index=False, encoding='ascii')

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1344, in to_csv
    formatter.save()

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1551, in save
    self._save()

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1652, in _save
    self._save_chunk(start_i, end_i)

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1678, in _save_chunk
    lib.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer)

  File "pandas\lib.pyx", line 1075, in pandas.lib.write_csv_rows (pandas\lib.c:19767)

UnicodeEncodeError: 'ascii' codec can't encode character '\xb4' in position 7: ordinal not in range(128)

If I change to_csv to have utf-8 encoding, then I can't read the file in properly:

如果我将 to_csv 更改为 utf-8 编码，则无法正确读取文件：

newdf.to_csv('output.txt',sep='|',index=False,encoding='utf-8')
pd.read_csv('output.txt', sep='|')

> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 2: invalid start byte

My goal is to have a pipe-delimited file that retains the accents and special characters.

我的目标是拥有一个保留重音和特殊字符的管道分隔文件。

Also, is there an easy way to figure out which line read_csv is breaking on? Right now I don't know how to get it to show me the bad character(s).

另外，是否有一种简单的方法可以确定 read_csv 正在中断哪一行？现在我不知道如何让它向我展示坏角色。

Answer 1

采纳答案by AlexG

You have some characters that are not ASCII and therefore cannot be encoded as you are trying to do. I would just use utf-8as suggested in a comment.

您有一些不是 ASCII 的字符，因此无法像您尝试那样进行编码。我只会utf-8按照评论中的建议使用。

To check which lines are causing the issue you can try something like this:

要检查哪些线路导致了问题，您可以尝试以下操作：

def is_not_ascii(string):
    return string is not None and any([ord(s) >= 128 for s in string])

df[df[col].apply(is_not_ascii)]

You'll need to specify the column colyou are testing.

您需要指定col要测试的列。

Answer 2

回答by Ohad Zadok

Check the answer here

检查这里的答案

It's a much simpler solution:

这是一个更简单的解决方案：

newdf.to_csv("C:/tweetDF", sep='\t', encoding = 'utf-8')

Answer 3

回答by Edward Weinert

Another solution is to use string functions encode/decode with the 'ignore' option, but it will remove non-ascii characters:

另一种解决方案是使用带有 'ignore' 选项的字符串函数编码/解码，但它会删除非 ascii 字符：

df['text'] = df['text'].apply(lambda x: x.encode('ascii', 'ignore').decode('ascii'))

Python 熊猫 to_csv：ascii 无法编码字符

提问by ale19

采纳答案by AlexG

回答by Ohad Zadok

回答by Edward Weinert

相关推荐

最近更新

标签

Python 熊猫 to_csv：ascii 无法编码字符

提问by ale19

采纳答案by AlexG

回答by Ohad Zadok

回答by Edward Weinert

相关推荐

Python AttributeError: 'NoneType' 对象没有属性 'strip'

Python 蟒蛇画平行六面体

Python 使用 orient = 'index' 在 Pandas 数据框 from_dict 中设置列​​名

有条件地格式化 Python pandas 单元格

相关推荐

最近更新

标签

Python 使用 orient = 'index' 在 Pandas 数据框 from_dict 中设置列名