Python 熊猫填充模式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42789324/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:09:29  来源:igfitidea点击:

Pandas Fillna Mode

pythonpandasfillna

提问by Jim

I have a data set in which there is a column known as Native Country which contain around 30000records. Some are missing represented by NaNso I thought to fill it with mode()value. I wrote something like this:

我有一个数据集,其中有一列名为 Native Country 的列包含周围的30000记录。有些缺失了,NaN所以我想用mode()价值来填充它。我写了这样的东西:

data['Native Country'].fillna(data['Native Country'].mode(), inplace=True)

However when I do a count of missing values:

但是,当我计算缺失值时:

for col_name in data.columns: 
    print ("column:",col_name,".Missing:",sum(data[col_name].isnull()))

It is still coming up with the same number of NaNvalues for the column Native Country.

它仍然NaN为 Native Country 列提供相同数量的值。

回答by zipa

Just call first element of series:

只需调用系列的第一个元素:

data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)

or you can do the same with assisgnment:

或者你也可以用赋值来做同样的事情:

data['Native Country'] = data['Native Country'].fillna(data['Native Country'].mode()[0])

回答by simone

Be careful, NaN may be the mode of your dataframe: in this case, you are replacing NaN with another NaN.

请注意,NaN 可能是您的数据帧的模式:在这种情况下,您将用另一个 NaN 替换 NaN。

回答by Audris Lo?melis

If we fill in the missing values with fillna(df['colX'].mode()), since the result of mode()is a Series, it will only fill in the first couple of rows for the matching indices. At least if done as below:

如果我们用 填充缺失值fillna(df['colX'].mode()),因为 的结果mode()是一个系列,它只会填充匹配索引的前几行。至少如果按照以下方式完成:

fill_mode = lambda col: col.fillna(col.mode())
df.apply(fill_mode, axis=0)

However, by simply taking the first value of the Series fillna(df['colX'].mode()[0]), I think we risk introducing unintended bias in the data. If the sample is multimodal, taking just the first mode value makes the already biased imputation method worse. For example, taking only 0if we have [0, 21, 99]as the equally most frequent values. Or filling missing values with Falsewhen Trueand Falsevalues are equally frequent in a given column.

然而,通过简单地取 Series 的第一个值fillna(df['colX'].mode()[0]),我认为我们可能会在数据中引入意外偏差。如果样本是多峰的,只取第一个众数会使已经有偏差的插补方法变得更糟。例如,仅0当我们有[0, 21, 99]同样最频繁的值时才取。或者用FalsewhenTrueFalsevalues 在给定列中同样频繁地填充缺失值。

I don't have a clear cut solution here. Assigning a random value from all the local maxima could be one approach if using the mode is a necessity.

我在这里没有明确的解决方案。如果必须使用该模式,则从所有局部最大值中分配一个随机值可能是一种方法。