Python 如何使用 Pandas 读取 UTF-8 文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36462852/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read UTF-8 files with Pandas?
提问by Istvan
I have a UTF-8 file with twitter data and I am trying to read it into a Python data frame but I can only get an 'object' type instead of unicode strings:
我有一个带有 twitter 数据的 UTF-8 文件,我试图将它读入 Python 数据框,但我只能得到一个“对象”类型而不是 unicode 字符串:
# file 1459966468_324.csv
#1459966468_324.csv: UTF-8 Unicode English text
df = pd.read_csv('1459966468_324.csv', dtype={'text': unicode})
df.dtypes
text object
Airline object
name object
retweet_count float64
sentiment object
tweet_location object
dtype: object
What is the right way of reading and coercing UTF-8 data into unicode with Pandas?
使用 Pandas 读取 UTF-8 数据并将其强制转换为 unicode 的正确方法是什么?
This does not solve the problem:
这并不能解决问题:
df = pd.read_csv('1459966468_324.csv', encoding = 'utf8')
df.apply(lambda x: pd.lib.infer_dtype(x.values))
Text file is here: https://raw.githubusercontent.com/l1x/nlp/master/1459966468_324.csv
文本文件在这里:https: //raw.githubusercontent.com/l1x/nlp/master/1459966468_324.csv
回答by Sam
As the other poster mentioned, you might try:
正如另一张海报所提到的,您可以尝试:
df = pd.read_csv('1459966468_324.csv', encoding='utf8')
However this could still leave you looking at 'object' when you print the dtypes. To confirm they are utf8, try this line after reading the CSV:
但是,当您打印 dtypes 时,这仍然可能让您看到“对象”。要确认它们是 utf8,请在阅读 CSV 后尝试这一行:
df.apply(lambda x: pd.lib.infer_dtype(x.values))
Example output:
示例输出:
args unicode
date datetime64
host unicode
kwargs unicode
operation unicode
回答by Stefan
Use the encoding
keyword with the appropriate parameter:
使用encoding
带有适当参数的关键字:
df = pd.read_csv('1459966468_324.csv', encoding='utf8')
回答by ptrj
Pandas stores strings in object
s. In python 3, all string are in unicode by default. So if you use python 3, your data is already in unicode (don't be mislead by type object
).
Pandas 将字符串存储在object
s 中。在python 3中,默认情况下所有字符串都是unicode。所以如果你使用 python 3,你的数据已经是 unicode 了(不要被 type 误导object
)。
If you have python 2, then use df = pd.read_csv('your_file', encoding = 'utf8')
. Then try for example pd.lib.infer_dtype(df.iloc[0,0])
(I guess the first col consists of strings.)
如果你有 python 2,那么使用df = pd.read_csv('your_file', encoding = 'utf8')
. 然后尝试例如pd.lib.infer_dtype(df.iloc[0,0])
(我猜第一个 col 由字符串组成。)