Python 将包含 NaN 的 Pandas 列转换为 dtype `int`
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21287624/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert Pandas column containing NaNs to dtype `int`
提问by Zhubarb
I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the idseries has missing/empty values.
我将数据从 .csv 文件读取到 Pandas 数据框,如下所示。对于其中一列,即id,我想将列类型指定为int。问题是该id系列缺少/空值。
When I try to cast the idcolumn to integer while reading the .csv, I get:
当我id在读取 .csv时尝试将列转换为整数时,我得到:
df= pd.read_csv("data.csv", dtype={'id': int})
error: Integer column has NA values
Alternatively, I tried to convert the column type after reading as below, but this time I get:
或者,我在阅读以下内容后尝试转换列类型,但这次我得到:
df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer
How can I tackle this?
我该如何解决这个问题?
采纳答案by Andy Hayden
The lack of NaN rep in integer columns is a pandas "gotcha".
整数列中缺少 NaN 代表是熊猫的“陷阱”。
The usual workaround is to simply use floats.
通常的解决方法是简单地使用浮点数。
回答by gboffi
If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that idis an integer, strictly greater than zero, you could use 0as a sentinel value so that you can write
如果您可以修改存储的数据,请为 missing 使用标记值id。一个常见的用例,由列名推断,它id是一个严格大于零的整数,您可以0用作标记值,以便您可以编写
if row['id']:
regular_process(row)
else:
special_process(row)
回答by Justin Malinchak
Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.
假设您的 DateColumn 格式为 3312018.0 应转换为 03/31/2018 作为字符串。并且,一些记录丢失或为 0。
df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))
回答by hibernado
My use case is munging data prior to loading into a DB table:
我的用例是在加载到数据库表之前处理数据:
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
Remove NaNs, convert to int, convert to str and then reinsert NANs.
删除 NaN,转换为 int,转换为 str,然后重新插入 NAN。
It's not pretty but it gets the job done!
它不漂亮,但它完成了工作!
回答by Neuneck
I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csvin a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:
我在使用 pyspark 时遇到了这个问题。由于这是在 jvm 上运行的代码的 python 前端,它需要类型安全并且使用 float 而不是 int 不是一种选择。我通过将熊猫包装pd.read_csv在一个函数中来解决这个问题,该函数将在将用户定义的列转换为所需的类型之前用用户定义的填充值填充它们。这是我最终使用的:
def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
return pd.read_csv(file_path, **kwargs)
else:
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
else:
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df
回答by kamran kausar
First remove the rows which contain NaN. Then do Integer conversion on remaining rows. At Last insert the removed rows again. Hope it will work
首先删除包含 NaN 的行。然后对剩余的行进行整数转换。最后再次插入删除的行。希望它会起作用
回答by elomage
You could use .dropna()if it is OK to drop the rows with the NaN values.
如果可以.dropna()删除具有 NaN 值的行,您可以使用。
df = df.dropna(subset=['id'])
Alternatively,
use .fillna()and .astype()to replace the NaN with values and convert them to int.
或者,使用.fillna()和.astype()用值替换 NaN 并将它们转换为 int。
I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.
我在处理包含大整数的 CSV 文件时遇到了这个问题,而其中一些丢失了(NaN)。使用 float 作为类型不是一种选择,因为我可能会失去精度。
My solution was to use str as the intermediate type. Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.
我的解决方案是使用 str 作为中间类型。然后,您可以在代码中稍后根据需要将字符串转换为 int。我用 0 替换了 NaN,但您可以选择任何值。
df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)
For the illustration, here is an example how floats may loose the precision:
为了说明,以下是浮点数可能会降低精度的示例:
s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)
And the output is:
输出是:
1.2345678901234567e+19 12345678901234567168 12345678901234567890
回答by jmenglund
If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:
如果您绝对想在一列中组合整数和 NaN,您可以使用 'object' 数据类型:
df['col'] = (
df['col'].fillna(0)
.astype(int)
.astype(object)
.where(df['col'].notnull())
)
This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.
这将用整数替换 NaN(与哪个无关),转换为 int,转换为对象,最后重新插入 NaN。
回答by Corbin
Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.
这里的大多数解决方案都会告诉您如何使用占位符整数来表示空值。如果您不确定整数不会出现在源数据中,那么这种方法就没有用了。我的方法将格式化没有十进制值的浮点数,并将空值转换为无。结果是一个对象数据类型,当加载到 CSV 时,它看起来像一个具有空值的整数字段。
keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
回答by jezrael
In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.
在 0.24.+ 版本中,pandas 获得了保存带有缺失值的整数 dtype 的能力。
Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array()or Series:
Pandas 可以使用arrays.IntegerArray. 这是在 Pandas 中实现的扩展类型。它不是整数的默认 dtype,不会被推断;您必须将 dtype 显式传递到array()or 中Series:
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)
0 1
1 2
2 NaN
dtype: Int64
For convert column to nullable integers use:
要将列转换为可为空的整数,请使用:
df['myCol'] = df['myCol'].astype('Int64')

