pandas pd.read_csv 默认将整数视为浮点数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39666308/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pd.read_csv by default treats integers like floats
提问by codingknob
I have a csv
that looks like (headers = first row):
我有一个csv
看起来像(标题=第一行):
name,a,a1,b,b1
arnold,300311,arnld01,300311,arnld01
sam,300713,sam01,300713,sam01
When I run:
当我运行时:
df = pd.read_csv('file.csv')
Columns a
and b
have a .0
attached to the end like so:
列a
并b
有一个.0
附加到末尾,如下所示:
df.head()
name,a,a1,b,b1
arnold,300311.0,arnld01,300311.0,arnld01
sam,300713.0,sam01,300713.0,sam01
Columns a
and b
are integers or blanks so why does pd.read_csv()
treat them like floats and how do I ensure they are integers on the read?
列a
和b
是整数或空白,那么为什么pd.read_csv()
将它们视为浮点数,我如何确保它们在读取时是整数?
回答by Andy
As rootmentioned in the comments, this is a limitation of Pandas (and Numpy). NaN
is a float and the empty values you have in your CSV are NaN.
正如评论中提到的root,这是 Pandas(和 Numpy)的一个限制。NaN
是一个浮点数,您在 CSV 中的空值是 NaN。
This is listed in the gotchasof pandas as well.
这也列在Pandas的陷阱中。
You can work around this in a few ways.
您可以通过几种方式解决此问题。
For the examples below I used the following to import the data - note that I added a row with an empty value in columns a
and b
对于下面的示例,我使用以下内容导入数据 - 请注意,我在列中添加了一个空值的行,a
并且b
import pandas as pd
from StringIO import StringIO
data = """name,a,a1,b,b1
arnold,300311,arnld01,300311,arnld01
sam,300713,sam01,300713,sam01
test,,test01,,test01"""
df = pd.read_csv(StringIO(data), sep=",")
Drop NaN rows
删除 NaN 行
Your first option is to drop rows that contain this NaN
value. The downside of this, is that you lose the entire row. After getting your data into a dataframe, run this:
您的第一个选择是删除包含此NaN
值的行。这样做的缺点是您会丢失整行。将数据放入数据框后,运行以下命令:
df.dropna(inplace=True)
df.a = df.a.astype(int)
df.b = df.b.astype(int)
This drops all NaN
rows from the dataframe, then it converts column a
and column b
to an int
这NaN
将从数据框中删除所有行,然后将列a
和列b
转换为int
>>> df.dtypes
name object
a int32
a1 object
b int32
b1 object
dtype: object
>>> df
name a a1 b b1
0 arnold 300311 arnld01 300311 arnld01
1 sam 300713 sam01 300713 sam01
Fill NaN
with placeholder data
填充NaN
占位符数据
This option will replace all your NaN
values with a throw away value. That value is something you need to determine. For this test, I made it -999999
. This will allow use to keep the rest of the data, convert it to an int, and make it obvious what data is invalid. You'll be able to filter these rows out if you are making calculations based on the columns later.
此选项将用NaN
丢弃值替换您的所有值。该值是您需要确定的。对于这个测试,我做到了-999999
。这将允许使用保留其余数据,将其转换为 int,并使哪些数据无效。如果您稍后根据列进行计算,您将能够过滤掉这些行。
df.fillna(-999999, inplace=True)
df.a = df.a.astype(int)
df.b = df.b.astype(int)
This produces a dataframe like so:
这会产生一个像这样的数据帧:
>>> df.dtypes
name object
a int32
a1 object
b int32
b1 object
dtype: object
>>> df
name a a1 b b1
0 arnold 300311 arnld01 300311 arnld01
1 sam 300713 sam01 300713 sam01
2 test -999999 test01 -999999 test01
Leave the float values
保留浮点值
Finally, another choice is to leave the float values (and NaN
) and not worry about the non-integer data type.
最后,另一种选择是保留浮点值(和NaN
)而不用担心非整数数据类型。
回答by user2515138
Converting Float to Integer values using Pandas read_csv - Working ====================================================
使用 Pandas read_csv 将浮点数转换为整数值 - 工作 ======================================== ============
# Importing the dataset
dataset = pd.read_csv('WorldWarWeather_Data.csv')
X = dataset.iloc[:, 3:11].values
y = dataset.iloc[:, 2].values
X=X.astype(int)
y=y.astype(int)