pandas read_csv 导入为列提供混合类型

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25530836/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:24:00  来源:igfitidea点击:

pandas read_csv import gives mixed type for a column

pythonpandas

提问by lessthanl0l

I have a csv file that contains 130,000 rows. After reading in the file using pandas' read_csv function, one of the Column("CallGuid") has mixed object types.

我有一个包含 130,000 行的 csv 文件。使用 pandas 的 read_csv 函数读入文件后,其中一个 Column("CallGuid") 具有混合对象类型。

I did:

我做了:

df = pd.read_csv("data.csv")

Then I have this:

然后我有这个:

In [10]: df["CallGuid"][32767]
Out[10]: 4129237051L    

In [11]: df["CallGuid"][32768]
Out[11]: u'4129259051'

All rows <= 32767 are of type longand all rows > 32767 are unicode

所有行 <= 32767 都是类型long,所有行 > 32767 都是unicode

Why is this?

为什么是这样?

回答by paulo.filip3

As others have pointed out, your data could be malformed, like having quotes or something...

正如其他人指出的那样,您的数据可能格式不正确,例如有引号或其他东西......

Just try doing:

试着做:

import pandas as pd
import numpy as np

df = pd.read_csv("data.csv", dtype={"CallGuid": np.int64})

It's also more memory efficient, since pandas doesn't have to guess the data types.

它还具有更高的内存效率,因为 Pandas 不必猜测数据类型。

回答by WNG

OK I just experienced the same problem, with the same symptom : df[column][n] changed type after n>32767

好的,我刚刚遇到了同样的问题,具有相同的症状:df[column][n] 在 n>32767 后更改了类型

I indeed had a problem in my data, but not at all at line 32767

我的数据确实有问题,但在第 32767 行根本没有问题

Finding and modifying these few problematic lines solved my problem. I managed to localize the line that was problematic by using the following extremely dirty routine :

查找和修改这几行有问题的行解决了我的问题。我设法通过使用以下极其肮脏的例程来定位有问题的行:

df = pd.read_csv('data.csv',chunksize = 10000)
i=0
for chunk in df:
    print "{} {}".format(i,chunk["Custom Dimension 02"].dtype)
    i+=1

I ran this and I obtained :

我跑了这个,我得到了:

0 int64
1 int64
2 int64
3 int64
4 int64
5 int64
6 object
7 int64
8 object
9 int64
10 int64

Which told me that there was (at least) one problematic line between 60000 and 69999 and one between 80000 and 89999

这告诉我在 60000 和 69999 之间有(至少)一条有问题的线,在 80000 和 89999 之间有一条线

To localize them more precisely, you can just take a smaller chunksize and print only the number of the rows that do not have the correct dta type

为了更精确地定位它们,您可以采用较小的块大小并仅打印没有正确 dta 类型的行数