Python 熊猫填充模式

Question

提问by Jim

I have a data set in which there is a column known as Native Country which contain around 30000records. Some are missing represented by NaNso I thought to fill it with mode()value. I wrote something like this:

我有一个数据集，其中有一列名为 Native Country 的列包含周围的30000记录。有些缺失了，NaN所以我想用mode()价值来填充它。我写了这样的东西：

data['Native Country'].fillna(data['Native Country'].mode(), inplace=True)

However when I do a count of missing values:

但是，当我计算缺失值时：

for col_name in data.columns: 
    print ("column:",col_name,".Missing:",sum(data[col_name].isnull()))

It is still coming up with the same number of NaNvalues for the column Native Country.

它仍然NaN为 Native Country 列提供相同数量的值。

Answer 1

回答by zipa

Just call first element of series:

只需调用系列的第一个元素：

data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)

or you can do the same with assisgnment:

或者你也可以用赋值来做同样的事情：

data['Native Country'] = data['Native Country'].fillna(data['Native Country'].mode()[0])

Answer 2

回答by simone

Be careful, NaN may be the mode of your dataframe: in this case, you are replacing NaN with another NaN.

请注意，NaN 可能是您的数据帧的模式：在这种情况下，您将用另一个 NaN 替换 NaN。

Answer 3

回答by Audris Lo?melis

If we fill in the missing values with fillna(df['colX'].mode()), since the result of mode()is a Series, it will only fill in the first couple of rows for the matching indices. At least if done as below:

如果我们用填充缺失值fillna(df['colX'].mode())，因为的结果mode()是一个系列，它只会填充匹配索引的前几行。至少如果按照以下方式完成：

fill_mode = lambda col: col.fillna(col.mode())
df.apply(fill_mode, axis=0)

However, by simply taking the first value of the Series fillna(df['colX'].mode()[0]), I think we risk introducing unintended bias in the data. If the sample is multimodal, taking just the first mode value makes the already biased imputation method worse. For example, taking only 0if we have [0, 21, 99]as the equally most frequent values. Or filling missing values with Falsewhen Trueand Falsevalues are equally frequent in a given column.

然而，通过简单地取 Series 的第一个值fillna(df['colX'].mode()[0])，我认为我们可能会在数据中引入意外偏差。如果样本是多峰的，只取第一个众数会使已经有偏差的插补方法变得更糟。例如，仅0当我们有[0, 21, 99]同样最频繁的值时才取。或者用FalsewhenTrue和Falsevalues 在给定列中同样频繁地填充缺失值。

I don't have a clear cut solution here. Assigning a random value from all the local maxima could be one approach if using the mode is a necessity.

我在这里没有明确的解决方案。如果必须使用该模式，则从所有局部最大值中分配一个随机值可能是一种方法。

Python 熊猫填充模式

提问by Jim

回答by zipa

回答by simone

回答by Audris Lo?melis

相关推荐

最近更新

标签

Python 熊猫填充模式

提问by Jim

回答by zipa

回答by simone

回答by Audris Lo?melis

相关推荐

在 Python 3 中加速数百万个正则表达式替换

Python 在 pycharm 中使用 Conda 环境

在python中生成15分钟的时间间隔数组

如何将参数传递给 Python 模块中的 main 函数？

相关推荐

最近更新

标签