Python 如何使用 Pandas 读取 UTF-8 文件？

Question

提问by Istvan

I have a UTF-8 file with twitter data and I am trying to read it into a Python data frame but I can only get an 'object' type instead of unicode strings:

我有一个带有 twitter 数据的 UTF-8 文件，我试图将它读入 Python 数据框，但我只能得到一个“对象”类型而不是 unicode 字符串：

# file 1459966468_324.csv
#1459966468_324.csv: UTF-8 Unicode English text
df = pd.read_csv('1459966468_324.csv', dtype={'text': unicode})
df.dtypes
text               object
Airline            object
name               object
retweet_count     float64
sentiment          object
tweet_location     object
dtype: object

What is the right way of reading and coercing UTF-8 data into unicode with Pandas?

使用 Pandas 读取 UTF-8 数据并将其强制转换为 unicode 的正确方法是什么？

This does not solve the problem:

这并不能解决问题：

df = pd.read_csv('1459966468_324.csv', encoding = 'utf8')
df.apply(lambda x: pd.lib.infer_dtype(x.values))

Text file is here: https://raw.githubusercontent.com/l1x/nlp/master/1459966468_324.csv

文本文件在这里：https: //raw.githubusercontent.com/l1x/nlp/master/1459966468_324.csv

Answer 1

回答by Sam

As the other poster mentioned, you might try:

正如另一张海报所提到的，您可以尝试：

df = pd.read_csv('1459966468_324.csv', encoding='utf8')

However this could still leave you looking at 'object' when you print the dtypes. To confirm they are utf8, try this line after reading the CSV:

但是，当您打印 dtypes 时，这仍然可能让您看到“对象”。要确认它们是 utf8，请在阅读 CSV 后尝试这一行：

df.apply(lambda x: pd.lib.infer_dtype(x.values))

Example output:

示例输出：

args            unicode
date         datetime64
host            unicode
kwargs          unicode
operation       unicode

Answer 2

回答by Stefan

Use the encodingkeyword with the appropriate parameter:

使用encoding带有适当参数的关键字：

df = pd.read_csv('1459966468_324.csv', encoding='utf8')

Answer 3

回答by ptrj

Pandas stores strings in objects. In python 3, all string are in unicode by default. So if you use python 3, your data is already in unicode (don't be mislead by type object).

Pandas 将字符串存储在objects 中。在python 3中，默认情况下所有字符串都是unicode。所以如果你使用 python 3，你的数据已经是 unicode 了（不要被 type 误导object）。

If you have python 2, then use df = pd.read_csv('your_file', encoding = 'utf8'). Then try for example pd.lib.infer_dtype(df.iloc[0,0])(I guess the first col consists of strings.)

如果你有 python 2，那么使用df = pd.read_csv('your_file', encoding = 'utf8'). 然后尝试例如pd.lib.infer_dtype(df.iloc[0,0])（我猜第一个 col 由字符串组成。）

Python 如何使用 Pandas 读取 UTF-8 文件？

提问by Istvan

回答by Sam

回答by Stefan

回答by ptrj

相关推荐

最近更新

标签

Python 如何使用 Pandas 读取 UTF-8 文件？

提问by Istvan

回答by Sam

回答by Stefan

回答by ptrj

相关推荐

如何在 Python 中使用 Pretty Table 打印多个列表中的数据？

运行时错误：此事件循环已在 python 中运行

Python 检查 OpenCV (cv) 的版本

Python 类型错误：使用 imshow() 绘制数组时图像数据的维度无效

相关推荐

最近更新

标签