pandas 如何删除Python中缺失值过多的列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45515031/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:11:16  来源:igfitidea点击:

how to remove columns with too many missing values in Python

pythonpandasdataframescikit-learnmissing-data

提问by HHH

I'm working on a machine learning problem in which there are many missing values in the features. There are 100's of features and I would like to remove those features that have too many missing values (it can be features with more than 80% missing values). How can I do that in Python.

我正在研究一个机器学习问题,其中特征中有许多缺失值。有 100 个特征,我想删除那些缺失值过多的特征(可能是缺失值超过 80% 的特征)。我怎样才能在 Python 中做到这一点。

p.s. my data is a Pandas dataframe.

ps 我的数据是 Pandas 数据框。

回答by MaxU

Demo:

演示:

Setup:

设置:

In [105]: df = pd.DataFrame(np.random.choice([2,np.nan], (20, 5), p=[0.2, 0.8]), columns=list('abcde'))

In [106]: df
Out[106]:
      a    b    c    d    e
0   NaN  2.0  NaN  NaN  NaN
1   NaN  NaN  2.0  NaN  2.0
2   NaN  2.0  NaN  NaN  NaN
3   NaN  NaN  NaN  NaN  2.0
4   NaN  2.0  2.0  NaN  NaN
5   NaN  NaN  NaN  NaN  NaN
6   NaN  2.0  NaN  NaN  NaN
7   2.0  2.0  NaN  NaN  NaN
8   2.0  2.0  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN  NaN
10  NaN  2.0  2.0  NaN  2.0
11  NaN  NaN  NaN  2.0  NaN
12  2.0  NaN  NaN  2.0  NaN
13  NaN  NaN  NaN  2.0  NaN
14  NaN  NaN  NaN  2.0  2.0
15  NaN  NaN  NaN  NaN  NaN
16  NaN  2.0  NaN  NaN  NaN
17  2.0  NaN  NaN  NaN  2.0
18  NaN  NaN  NaN  2.0  NaN
19  NaN  2.0  NaN  2.0  NaN

In [107]: df.isnull().mean()
Out[107]:
a    0.80
b    0.55
c    0.85
d    0.70
e    0.75
dtype: float64

Solution:

解决方案:

In [108]: df.columns[df.isnull().mean() < 0.8]
Out[108]: Index(['b', 'd', 'e'], dtype='object')

In [109]: df[df.columns[df.isnull().mean() < 0.8]]
Out[109]:
      b    d    e
0   2.0  NaN  NaN
1   NaN  NaN  2.0
2   2.0  NaN  NaN
3   NaN  NaN  2.0
4   2.0  NaN  NaN
5   NaN  NaN  NaN
6   2.0  NaN  NaN
7   2.0  NaN  NaN
8   2.0  NaN  NaN
9   NaN  NaN  NaN
10  2.0  NaN  2.0
11  NaN  2.0  NaN
12  NaN  2.0  NaN
13  NaN  2.0  NaN
14  NaN  2.0  2.0
15  NaN  NaN  NaN
16  2.0  NaN  NaN
17  NaN  NaN  2.0
18  NaN  2.0  NaN
19  2.0  2.0  NaN

回答by Anton vBR

Following MaxU example this is the option for filtering rows:

以下 MaxU 示例是用于过滤行的选项:

df = pd.DataFrame(np.random.choice([2,np.nan], (5,10), p=[0.2, 0.8]), columns=list('abcdefghij'))

a   b   c   d   e   f   g   h   i   j
0   NaN NaN NaN NaN NaN 2.0 NaN NaN NaN 2.0
1   NaN 2.0 NaN 2.0 NaN NaN 2.0 NaN NaN 2.0
2   NaN NaN 2.0 NaN 2.0 NaN 2.0 2.0 NaN NaN
3   NaN NaN NaN NaN NaN 2.0 NaN NaN NaN 2.0
4   2.0 2.0 2.0 NaN NaN NaN NaN NaN NaN NaN

Rows

df.loc[df.isnull().mean(axis=1).lt(0.8)]

    a   b   c   d   e   f   g   h   i   j
1   NaN 2.0 NaN 2.0 NaN NaN 2.0 NaN NaN 2.0
2   NaN NaN 2.0 NaN 2.0 NaN 2.0 2.0 NaN NaN
4   2.0 2.0 2.0 NaN NaN NaN NaN NaN NaN NaN

回答by singmotor

You can use pandas dropna.

您可以使用Pandas dropna。

limitPer = len(yourdf) * .80
yourdf = yourdf.dropna(thresh=limitPer,axis=1)

回答by vealkind

To generalize within pandas you can do the following to calculate the percent of values in a column with missing values. From those columns you can filter out the features with more than 80% NULL values and then drop those columns from the DataFrame.

要在 Pandas 中进行概括,您可以执行以下操作来计算具有缺失值的列中值的百分比。从这些列中,您可以过滤掉超过 80% NULL 值的特征,然后从 DataFrame 中删除这些列。

pct_null = df.isnull().sum() / len(df)
missing_features = pct_null[pct_null > 0.80].index
df.drop(missing_features, axis=1, inplace=True)

回答by Suhas_Pote

Here is a simple function which you can use directly by passing dataframe and threshold

这是一个简单的函数,您可以通过传递数据框和阈值来直接使用它

def rmissingvaluecol(dff,threshold):
    l = []
    l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
    print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
    print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
    return l


rmissingvaluecol(df,80) #Here threshold is 80% which means we are going to drop columns having more than 80% of missing values

#output
'''
# Columns having more than 60 percent missing values: 2
Columns:
 ['id', 'location']
'''

Now create new dataframe excluding these columns

现在创建不包括这些列的新数据框

l = rmissingvaluecol(df,49)
df1 = df[l]

Bonus step

奖励步骤

You can find the percentage of missing values for each column (optional)

您可以找到每列缺失值的百分比(可选)

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))

missing(df)

#output
'''
id          83.33
location    83.33
owner       16.67
pets        16.67
dtype: float64
'''

回答by BP34500

The fastest way to find the sum of NaN or the percentage by columns is :

按列查找 NaN 的总和或百分比的最快方法是:

  • for the sum : df.isna().sum()
  • for the percentage : df.isna().mean()
  • 总和:df.isna().sum()
  • 对于百分比:df.isna().mean()

回答by mannem srinivas

def show_null_columns(data,agg,threshold):
    if agg=='sum':
       null_cols=data.isnull().sum()
    elif agg=='mean':
       null_cols=data.isnull().mean()
    columns=data.columns
    null_dic={}
    for col,x in zip(columns,null_cols):
        if x>=threshold:
            null_dic[col]=x
    return null_dic
null_dic=show_null_columns(train,'mean',0.8)
train2=train.drop(null_dic.keys(),axis=1)