pandas pd.read_csv 默认将整数视为浮点数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39666308/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:05:00  来源:igfitidea点击:

pd.read_csv by default treats integers like floats

pythoncsvpandasinteger

提问by codingknob

I have a csvthat looks like (headers = first row):

我有一个csv看起来像(标题=第一行):

name,a,a1,b,b1
arnold,300311,arnld01,300311,arnld01
sam,300713,sam01,300713,sam01

When I run:

当我运行时:

df = pd.read_csv('file.csv')

Columns aand bhave a .0attached to the end like so:

ab有一个.0附加到末尾,如下所示:

df.head()

name,a,a1,b,b1
arnold,300311.0,arnld01,300311.0,arnld01
sam,300713.0,sam01,300713.0,sam01

Columns aand bare integers or blanks so why does pd.read_csv()treat them like floats and how do I ensure they are integers on the read?

ab是整数或空白,那么为什么pd.read_csv()将它们视为浮点数,我如何确保它们在读取时是整数?

回答by Andy

As rootmentioned in the comments, this is a limitation of Pandas (and Numpy). NaNis a float and the empty values you have in your CSV are NaN.

正如评论中提到的root,这是 Pandas(和 Numpy)的一个限制。NaN是一个浮点数,您在 CSV 中的空值是 NaN。

This is listed in the gotchasof pandas as well.

这也列在Pandas的陷阱中。

You can work around this in a few ways.

您可以通过几种方式解决此问题。

For the examples below I used the following to import the data - note that I added a row with an empty value in columns aand b

对于下面的示例,我使用以下内容导入数据 - 请注意,我在列中添加了一个空值的行,a并且b

import pandas as pd
from StringIO import StringIO

data = """name,a,a1,b,b1
arnold,300311,arnld01,300311,arnld01
sam,300713,sam01,300713,sam01
test,,test01,,test01"""

df = pd.read_csv(StringIO(data), sep=",")

Drop NaN rows

删除 NaN 行

Your first option is to drop rows that contain this NaNvalue. The downside of this, is that you lose the entire row. After getting your data into a dataframe, run this:

您的第一个选择是删除包含此NaN值的行。这样做的缺点是您会丢失整行。将数据放入数据框后,运行以下命令:

df.dropna(inplace=True)
df.a = df.a.astype(int)
df.b = df.b.astype(int)

This drops all NaNrows from the dataframe, then it converts column aand column bto an int

NaN将从数据框中删除所有行,然后将列a和列b转换为int

>>> df.dtypes
name    object
a        int32
a1      object
b        int32
b1      object
dtype: object

>>> df
     name       a       a1       b       b1
0  arnold  300311  arnld01  300311  arnld01
1     sam  300713    sam01  300713    sam01

Fill NaNwith placeholder data

填充NaN占位符数据

This option will replace all your NaNvalues with a throw away value. That value is something you need to determine. For this test, I made it -999999. This will allow use to keep the rest of the data, convert it to an int, and make it obvious what data is invalid. You'll be able to filter these rows out if you are making calculations based on the columns later.

此选项将用NaN丢弃值替换您的所有值。该值是您需要确定的。对于这个测试,我做到了-999999。这将允许使用保留其余数据,将其转换为 int,并使哪些数据无效。如果您稍后根据列进行计算,您将能够过滤掉这些行。

df.fillna(-999999, inplace=True)
df.a = df.a.astype(int)
df.b = df.b.astype(int)

This produces a dataframe like so:

这会产生一个像这样的数据帧:

>>> df.dtypes
name    object
a        int32
a1      object
b        int32
b1      object
dtype: object

>>> df
     name       a       a1       b       b1
0  arnold  300311  arnld01  300311  arnld01
1     sam  300713    sam01  300713    sam01
2    test -999999   test01 -999999   test01

Leave the float values

保留浮点值

Finally, another choice is to leave the float values (and NaN) and not worry about the non-integer data type.

最后,另一种选择是保留浮点值(和NaN)而不用担心非整数数据类型。

回答by user2515138

Converting Float to Integer values using Pandas read_csv - Working ====================================================

使用 Pandas read_csv 将浮点数转换为整数值 - 工作 ======================================== ============

# Importing the dataset
dataset = pd.read_csv('WorldWarWeather_Data.csv')
X = dataset.iloc[:, 3:11].values
y = dataset.iloc[:, 2].values
X=X.astype(int)
y=y.astype(int)