Python Pandas - 在分类数据中填充 NaN
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32718639/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas - filling NaNs in Categorical data
提问by deega
I am trying to fill missing values (NAN) using the below code
我正在尝试使用以下代码填充缺失值(NAN)
NAN_SUBSTITUTION_VALUE = 1
g = g.fillna(NAN_SUBSTITUTION_VALUE)
but I am getting the following error
但我收到以下错误
ValueError: fill value must be in categories.
Would anybody please throw some light on this error.
有没有人请对这个错误有所了解。
回答by pacholik
Once you create Categorical Data, you can insert only values in category.
创建Categorical Data 后,您只能在类别中插入值。
>>> df
ID value
0 0 20
1 1 43
2 2 45
>>> df["cat"] = df["value"].astype("category")
>>> df
ID value cat
0 0 20 20
1 1 43 43
2 2 45 45
>>> df.loc[1, "cat"] = np.nan
>>> df
ID value cat
0 0 20 20
1 1 43 NaN
2 2 45 45
>>> df.fillna(1)
ValueError: fill value must be in categories
>>> df.fillna(43)
ID value cat
0 0 20 20
1 1 43 43
2 2 45 45
回答by Gunnar Cheng
Add the category before you fill:
在填写之前添加类别:
g = g.cat.add_categories([1])
g.fillna(1)
回答by bluenote10
Your question is missing the important point what g
is, especially that it has dtype categorical
. I assume it is something like this:
您的问题缺少重点是什么g
,尤其是它具有 dtype categorical
。我假设它是这样的:
g = pd.Series(["A", "B", "C", np.nan], dtype="category")
The problem you are experiencing is that fillna
requires a value that already exists as a category. For instance, g.fillna("A")
would work, but g.fillna("D")
fails. To fill the series with a new value you can do:
您遇到的问题是fillna
需要一个已作为类别存在的值。例如,g.fillna("A")
会工作,但g.fillna("D")
失败。要使用新值填充系列,您可以执行以下操作:
g_without_nan = g.cat.add_categories("D").fillna("D")
回答by Victor Zuanazzi
Sometimes you may want to replace the NaN with values present in your dataset, you can use that then:
有时您可能想用数据集中存在的值替换 NaN,然后可以使用它:
#creates a random permuation of the categorical values
permutation = np.random.permutation(df[field])
#erase the empty values
empty_is = np.where(permutation == "")
permutation = np.delete(permutation, empty_is)
#replace all empty values of the dataframe[field]
end = len(permutation)
df[field] = df[field].apply(lambda x: permutation[np.random.randint(end)] if pd.isnull(x) else x)
It works quite efficiently.
它的工作效率很高。