Python 熊猫填充模式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42789324/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas Fillna Mode
提问by Jim
I have a data set in which there is a column known as Native Country which contain around 30000
records. Some are missing represented by NaN
so I thought to fill it with mode()
value. I wrote something like this:
我有一个数据集,其中有一列名为 Native Country 的列包含周围的30000
记录。有些缺失了,NaN
所以我想用mode()
价值来填充它。我写了这样的东西:
data['Native Country'].fillna(data['Native Country'].mode(), inplace=True)
However when I do a count of missing values:
但是,当我计算缺失值时:
for col_name in data.columns:
print ("column:",col_name,".Missing:",sum(data[col_name].isnull()))
It is still coming up with the same number of NaN
values for the column Native Country.
它仍然NaN
为 Native Country 列提供相同数量的值。
回答by zipa
Just call first element of series:
只需调用系列的第一个元素:
data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)
or you can do the same with assisgnment:
或者你也可以用赋值来做同样的事情:
data['Native Country'] = data['Native Country'].fillna(data['Native Country'].mode()[0])
回答by simone
Be careful, NaN may be the mode of your dataframe: in this case, you are replacing NaN with another NaN.
请注意,NaN 可能是您的数据帧的模式:在这种情况下,您将用另一个 NaN 替换 NaN。
回答by Audris Lo?melis
If we fill in the missing values with fillna(df['colX'].mode())
, since the result of mode()
is a Series, it will only fill in the first couple of rows for the matching indices. At least if done as below:
如果我们用 填充缺失值fillna(df['colX'].mode())
,因为 的结果mode()
是一个系列,它只会填充匹配索引的前几行。至少如果按照以下方式完成:
fill_mode = lambda col: col.fillna(col.mode())
df.apply(fill_mode, axis=0)
However, by simply taking the first value of the Series fillna(df['colX'].mode()[0])
, I think we risk introducing unintended bias in the data. If the sample is multimodal, taking just the first mode value makes the already biased imputation method worse. For example, taking only 0
if we have [0, 21, 99]
as the equally most frequent values. Or filling missing values with False
when True
and False
values are equally frequent in a given column.
然而,通过简单地取 Series 的第一个值fillna(df['colX'].mode()[0])
,我认为我们可能会在数据中引入意外偏差。如果样本是多峰的,只取第一个众数会使已经有偏差的插补方法变得更糟。例如,仅0
当我们有[0, 21, 99]
同样最频繁的值时才取。或者用False
whenTrue
和False
values 在给定列中同样频繁地填充缺失值。
I don't have a clear cut solution here. Assigning a random value from all the local maxima could be one approach if using the mode is a necessity.
我在这里没有明确的解决方案。如果必须使用该模式,则从所有局部最大值中分配一个随机值可能是一种方法。