pandas 如何删除Python中缺失值过多的列

Question

提问by HHH

I'm working on a machine learning problem in which there are many missing values in the features. There are 100's of features and I would like to remove those features that have too many missing values (it can be features with more than 80% missing values). How can I do that in Python.

我正在研究一个机器学习问题，其中特征中有许多缺失值。有 100 个特征，我想删除那些缺失值过多的特征（可能是缺失值超过 80% 的特征）。我怎样才能在 Python 中做到这一点。

p.s. my data is a Pandas dataframe.

ps 我的数据是 Pandas 数据框。

Answer 1

回答by MaxU

Demo:

演示：

Setup:

设置：

In [105]: df = pd.DataFrame(np.random.choice([2,np.nan], (20, 5), p=[0.2, 0.8]), columns=list('abcde'))

In [106]: df
Out[106]:
      a    b    c    d    e
0   NaN  2.0  NaN  NaN  NaN
1   NaN  NaN  2.0  NaN  2.0
2   NaN  2.0  NaN  NaN  NaN
3   NaN  NaN  NaN  NaN  2.0
4   NaN  2.0  2.0  NaN  NaN
5   NaN  NaN  NaN  NaN  NaN
6   NaN  2.0  NaN  NaN  NaN
7   2.0  2.0  NaN  NaN  NaN
8   2.0  2.0  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN  NaN
10  NaN  2.0  2.0  NaN  2.0
11  NaN  NaN  NaN  2.0  NaN
12  2.0  NaN  NaN  2.0  NaN
13  NaN  NaN  NaN  2.0  NaN
14  NaN  NaN  NaN  2.0  2.0
15  NaN  NaN  NaN  NaN  NaN
16  NaN  2.0  NaN  NaN  NaN
17  2.0  NaN  NaN  NaN  2.0
18  NaN  NaN  NaN  2.0  NaN
19  NaN  2.0  NaN  2.0  NaN

In [107]: df.isnull().mean()
Out[107]:
a    0.80
b    0.55
c    0.85
d    0.70
e    0.75
dtype: float64

Solution:

解决方案：

In [108]: df.columns[df.isnull().mean() < 0.8]
Out[108]: Index(['b', 'd', 'e'], dtype='object')

In [109]: df[df.columns[df.isnull().mean() < 0.8]]
Out[109]:
      b    d    e
0   2.0  NaN  NaN
1   NaN  NaN  2.0
2   2.0  NaN  NaN
3   NaN  NaN  2.0
4   2.0  NaN  NaN
5   NaN  NaN  NaN
6   2.0  NaN  NaN
7   2.0  NaN  NaN
8   2.0  NaN  NaN
9   NaN  NaN  NaN
10  2.0  NaN  2.0
11  NaN  2.0  NaN
12  NaN  2.0  NaN
13  NaN  2.0  NaN
14  NaN  2.0  2.0
15  NaN  NaN  NaN
16  2.0  NaN  NaN
17  NaN  NaN  2.0
18  NaN  2.0  NaN
19  2.0  2.0  NaN

Answer 2

回答by Anton vBR

Following MaxU example this is the option for filtering rows:

以下 MaxU 示例是用于过滤行的选项：

df = pd.DataFrame(np.random.choice([2,np.nan], (5,10), p=[0.2, 0.8]), columns=list('abcdefghij'))

a   b   c   d   e   f   g   h   i   j
0   NaN NaN NaN NaN NaN 2.0 NaN NaN NaN 2.0
1   NaN 2.0 NaN 2.0 NaN NaN 2.0 NaN NaN 2.0
2   NaN NaN 2.0 NaN 2.0 NaN 2.0 2.0 NaN NaN
3   NaN NaN NaN NaN NaN 2.0 NaN NaN NaN 2.0
4   2.0 2.0 2.0 NaN NaN NaN NaN NaN NaN NaN

Rows

行

df.loc[df.isnull().mean(axis=1).lt(0.8)]

    a   b   c   d   e   f   g   h   i   j
1   NaN 2.0 NaN 2.0 NaN NaN 2.0 NaN NaN 2.0
2   NaN NaN 2.0 NaN 2.0 NaN 2.0 2.0 NaN NaN
4   2.0 2.0 2.0 NaN NaN NaN NaN NaN NaN NaN

Answer 3

回答by singmotor

You can use pandas dropna.

您可以使用Pandas dropna。

limitPer = len(yourdf) * .80
yourdf = yourdf.dropna(thresh=limitPer,axis=1)

Answer 4

回答by vealkind

To generalize within pandas you can do the following to calculate the percent of values in a column with missing values. From those columns you can filter out the features with more than 80% NULL values and then drop those columns from the DataFrame.

要在 Pandas 中进行概括，您可以执行以下操作来计算具有缺失值的列中值的百分比。从这些列中，您可以过滤掉超过 80% NULL 值的特征，然后从 DataFrame 中删除这些列。

pct_null = df.isnull().sum() / len(df)
missing_features = pct_null[pct_null > 0.80].index
df.drop(missing_features, axis=1, inplace=True)

Answer 5

回答by Suhas_Pote

Here is a simple function which you can use directly by passing dataframe and threshold

这是一个简单的函数，您可以通过传递数据框和阈值来直接使用它

def rmissingvaluecol(dff,threshold):
    l = []
    l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
    print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
    print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
    return l


rmissingvaluecol(df,80) #Here threshold is 80% which means we are going to drop columns having more than 80% of missing values

#output
'''
# Columns having more than 60 percent missing values: 2
Columns:
 ['id', 'location']
'''

Now create new dataframe excluding these columns

现在创建不包括这些列的新数据框

l = rmissingvaluecol(df,49)
df1 = df[l]

Bonus step

奖励步骤

You can find the percentage of missing values for each column (optional)

您可以找到每列缺失值的百分比（可选）

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))

missing(df)

#output
'''
id          83.33
location    83.33
owner       16.67
pets        16.67
dtype: float64
'''

Answer 6

回答by BP34500

The fastest way to find the sum of NaN or the percentage by columns is :

按列查找 NaN 的总和或百分比的最快方法是：

for the sum : df.isna().sum()
for the percentage : df.isna().mean()

总和：df.isna().sum()
对于百分比：df.isna().mean()

Answer 7

回答by mannem srinivas

def show_null_columns(data,agg,threshold):
    if agg=='sum':
       null_cols=data.isnull().sum()
    elif agg=='mean':
       null_cols=data.isnull().mean()
    columns=data.columns
    null_dic={}
    for col,x in zip(columns,null_cols):
        if x>=threshold:
            null_dic[col]=x
    return null_dic
null_dic=show_null_columns(train,'mean',0.8)
train2=train.drop(null_dic.keys(),axis=1)

pandas 如何删除Python中缺失值过多的列

提问by HHH

回答by MaxU

回答by Anton vBR

回答by singmotor

回答by vealkind

回答by Suhas_Pote

Here is a simple function which you can use directly by passing dataframe and threshold

这是一个简单的函数，您可以通过传递数据框和阈值来直接使用它

Bonus step

奖励步骤

回答by BP34500

回答by mannem srinivas

相关推荐

最近更新

标签

pandas 如何删除Python中缺失值过多的列

提问by HHH

回答by MaxU

回答by Anton vBR

回答by singmotor

回答by vealkind

回答by Suhas_Pote

Here is a simple function which you can use directly by passing dataframe and threshold

这是一个简单的函数，您可以通过传递数据框和阈值来直接使用它

Bonus step

奖励步骤

回答by BP34500

回答by mannem srinivas

相关推荐

如何使用多个 numpy 1d 数组创建一个 Pandas DataFrame？

如何使用 Pandas 中的 groupby 计算绝对和？

警告！***HDF5 库版本不匹配错误*** python pandas windows

pandas 如何更改熊猫的时间格式？

相关推荐

最近更新

标签

警告！HDF5 库版本不匹配错误 python pandas windows