Python 如何使用 Pandas 读取 UTF-8 文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36462852/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:55:32  来源:igfitidea点击:

How to read UTF-8 files with Pandas?

pythoncsvpandasutf-8

提问by Istvan

I have a UTF-8 file with twitter data and I am trying to read it into a Python data frame but I can only get an 'object' type instead of unicode strings:

我有一个带有 twitter 数据的 UTF-8 文件,我试图将它读入 Python 数据框,但我只能得到一个“对象”类型而不是 unicode 字符串:

# file 1459966468_324.csv
#1459966468_324.csv: UTF-8 Unicode English text
df = pd.read_csv('1459966468_324.csv', dtype={'text': unicode})
df.dtypes
text               object
Airline            object
name               object
retweet_count     float64
sentiment          object
tweet_location     object
dtype: object

What is the right way of reading and coercing UTF-8 data into unicode with Pandas?

使用 Pandas 读取 UTF-8 数据并将其强制转换为 unicode 的正确方法是什么?

This does not solve the problem:

这并不能解决问题:

df = pd.read_csv('1459966468_324.csv', encoding = 'utf8')
df.apply(lambda x: pd.lib.infer_dtype(x.values))

Text file is here: https://raw.githubusercontent.com/l1x/nlp/master/1459966468_324.csv

文本文件在这里:https: //raw.githubusercontent.com/l1x/nlp/master/1459966468_324.csv

回答by Sam

As the other poster mentioned, you might try:

正如另一张海报所提到的,您可以尝试:

df = pd.read_csv('1459966468_324.csv', encoding='utf8')

However this could still leave you looking at 'object' when you print the dtypes. To confirm they are utf8, try this line after reading the CSV:

但是,当您打印 dtypes 时,这仍然可能让您看到“对象”。要确认它们是 utf8,请在阅读 CSV 后尝试这一行:

df.apply(lambda x: pd.lib.infer_dtype(x.values))

Example output:

示例输出:

args            unicode
date         datetime64
host            unicode
kwargs          unicode
operation       unicode

回答by Stefan

Use the encodingkeyword with the appropriate parameter:

使用encoding带有适当参数的关键字:

df = pd.read_csv('1459966468_324.csv', encoding='utf8')

回答by ptrj

Pandas stores strings in objects. In python 3, all string are in unicode by default. So if you use python 3, your data is already in unicode (don't be mislead by type object).

Pandas 将字符串存储在objects 中。在python 3中,默认情况下所有字符串都是unicode。所以如果你使用 python 3,你的数据已经是 unicode 了(不要被 type 误导object)。

If you have python 2, then use df = pd.read_csv('your_file', encoding = 'utf8'). Then try for example pd.lib.infer_dtype(df.iloc[0,0])(I guess the first col consists of strings.)

如果你有 python 2,那么使用df = pd.read_csv('your_file', encoding = 'utf8'). 然后尝试例如pd.lib.infer_dtype(df.iloc[0,0])(我猜第一个 col 由字符串组成。)