pandas 如何从python中的csv读取编码字符串的数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15610083/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:44:27  来源:igfitidea点击:

How to read a dataframe of encoded strings from csv in python

pythonutf-8pandas

提问by fabrizio_ff

Suppose I read an html website and I get a list of names, such as: 'Amiel, Henri-Frédéric'.

假设我阅读了一个 html 网站,并得到了一个姓名列表,例如:“Amiel, Henri-Frédéric”。

In order to get the list of names I decode the html using the following code:

为了获得名称列表,我使用以下代码对 html 进行解码:

f = urllib.urlopen("http://xxx.htm")
html = f.read()
html=html.decode('utf8')
t.feed(html)
t.close()
lista=t.data

At this point, the variable lista contains a list of names like:

此时,变量 lista 包含一个名称列表,例如:

[u'Abatantuono, Diego', ... , u'Amiel, Henri-Frédéric']

[u'Abatantuono, Diego', ... , u'Amiel, Henri-Frédéric']

Now I would like to:

现在我想:

  1. put these names inside a DataFrame;
  2. save the DataFrame in a csv file;
  3. read the csv in Python through a DataFrame
  1. 将这些名称放在 DataFrame 中;
  2. 将 DataFrame 保存在一个 csv 文件中;
  3. 通过 DataFrame 读取 Python 中的 csv

For simplicity, let's take in consideration just the above name to complete steps 1 to 3. I would use the following code:

为简单起见,让我们仅考虑上述名称来完成步骤 1 到 3。我将使用以下代码:

name=u'Amiel, Henri-Fr\xe9d\xe9ric'
name=name.encode('utf8')
array=[name]
df=pd.DataFrame({'Names':array})
df.to_csv('names')
uni=pd.read_csv('names')
uni #trying to read the csv file in a DataFrame

At this point i get the following error:

此时我收到以下错误:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 67: invalid continuation byte      

If I substitute the last row of the above code with:

如果我将上述代码的最后一行替换为:

print uni

I can read the DataFrame but I don't think it is the right way to handle this issue.

我可以读取 DataFrame,但我认为这不是处理此问题的正确方法。

I red many questions posted by other users about this argument but I didn't get to solve this one.

我提出了许多其他用户发布的关于这个论点的问题,但我没有解决这个问题。

回答by root

Bothto_csvmethod and read_csvfunction take an encodingargument. Use it. And work with unicode internally. If you don't, trying to encode/decode inside your program will get you.

两个to_csv方法和read_csv功能采取encoding参数。用它。并在内部使用 unicode。如果你不这样做,尝试在你的程序中编码/解码会让你得到.

import pandas as pd

name = u'Amiel, Henri-Fr\xe9d\xe9ric'
array = [name]
df = pd.DataFrame({'Names':array})
df.to_csv('names', encoding='utf-8')
uni = pd.read_csv('names', index_col = [0], encoding='utf-8')
print uni  # for me it works with or without print

out:

出去:

                   Names
0  Amiel, Henri-Frédéric