pandas read_csv 导入为列提供混合类型

Question

提问by lessthanl0l

I have a csv file that contains 130,000 rows. After reading in the file using pandas' read_csv function, one of the Column("CallGuid") has mixed object types.

我有一个包含 130,000 行的 csv 文件。使用 pandas 的 read_csv 函数读入文件后，其中一个 Column("CallGuid") 具有混合对象类型。

I did:

我做了：

df = pd.read_csv("data.csv")

Then I have this:

然后我有这个：

In [10]: df["CallGuid"][32767]
Out[10]: 4129237051L    

In [11]: df["CallGuid"][32768]
Out[11]: u'4129259051'

All rows <= 32767 are of type longand all rows > 32767 are unicode

所有行 <= 32767 都是类型long，所有行 > 32767 都是unicode

Why is this?

为什么是这样？

Answer 1

回答by paulo.filip3

As others have pointed out, your data could be malformed, like having quotes or something...

正如其他人指出的那样，您的数据可能格式不正确，例如有引号或其他东西......

Just try doing:

试着做：

import pandas as pd
import numpy as np

df = pd.read_csv("data.csv", dtype={"CallGuid": np.int64})

It's also more memory efficient, since pandas doesn't have to guess the data types.

它还具有更高的内存效率，因为 Pandas 不必猜测数据类型。

Answer 2

回答by WNG

OK I just experienced the same problem, with the same symptom : df[column][n] changed type after n>32767

好的，我刚刚遇到了同样的问题，具有相同的症状：df[column][n] 在 n>32767 后更改了类型

I indeed had a problem in my data, but not at all at line 32767

我的数据确实有问题，但在第 32767 行根本没有问题

Finding and modifying these few problematic lines solved my problem. I managed to localize the line that was problematic by using the following extremely dirty routine :

查找和修改这几行有问题的行解决了我的问题。我设法通过使用以下极其肮脏的例程来定位有问题的行：

df = pd.read_csv('data.csv',chunksize = 10000)
i=0
for chunk in df:
    print "{} {}".format(i,chunk["Custom Dimension 02"].dtype)
    i+=1

I ran this and I obtained :

我跑了这个，我得到了：

0 int64
1 int64
2 int64
3 int64
4 int64
5 int64
6 object
7 int64
8 object
9 int64
10 int64

Which told me that there was (at least) one problematic line between 60000 and 69999 and one between 80000 and 89999

这告诉我在 60000 和 69999 之间有（至少）一条有问题的线，在 80000 和 89999 之间有一条线

To localize them more precisely, you can just take a smaller chunksize and print only the number of the rows that do not have the correct dta type

为了更精确地定位它们，您可以采用较小的块大小并仅打印没有正确 dta 类型的行数

pandas read_csv 导入为列提供混合类型

提问by lessthanl0l

回答by paulo.filip3

回答by WNG

相关推荐

最近更新

标签

pandas read_csv 导入为列提供混合类型

提问by lessthanl0l

回答by paulo.filip3

回答by WNG

相关推荐

pandas 熊猫全球数据框

pandas 从 ElasticSearch 结果创建 DataFrame

numpy.ndarray 与 pandas.DataFrame

pandas Python 等效于 R 运算符“%in%”

相关推荐

最近更新

标签