pandas 如何删除Python中缺失值过多的列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45515031/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to remove columns with too many missing values in Python
提问by HHH
I'm working on a machine learning problem in which there are many missing values in the features. There are 100's of features and I would like to remove those features that have too many missing values (it can be features with more than 80% missing values). How can I do that in Python.
我正在研究一个机器学习问题,其中特征中有许多缺失值。有 100 个特征,我想删除那些缺失值过多的特征(可能是缺失值超过 80% 的特征)。我怎样才能在 Python 中做到这一点。
p.s. my data is a Pandas dataframe.
ps 我的数据是 Pandas 数据框。
回答by MaxU
Demo:
演示:
Setup:
设置:
In [105]: df = pd.DataFrame(np.random.choice([2,np.nan], (20, 5), p=[0.2, 0.8]), columns=list('abcde'))
In [106]: df
Out[106]:
a b c d e
0 NaN 2.0 NaN NaN NaN
1 NaN NaN 2.0 NaN 2.0
2 NaN 2.0 NaN NaN NaN
3 NaN NaN NaN NaN 2.0
4 NaN 2.0 2.0 NaN NaN
5 NaN NaN NaN NaN NaN
6 NaN 2.0 NaN NaN NaN
7 2.0 2.0 NaN NaN NaN
8 2.0 2.0 NaN NaN NaN
9 NaN NaN NaN NaN NaN
10 NaN 2.0 2.0 NaN 2.0
11 NaN NaN NaN 2.0 NaN
12 2.0 NaN NaN 2.0 NaN
13 NaN NaN NaN 2.0 NaN
14 NaN NaN NaN 2.0 2.0
15 NaN NaN NaN NaN NaN
16 NaN 2.0 NaN NaN NaN
17 2.0 NaN NaN NaN 2.0
18 NaN NaN NaN 2.0 NaN
19 NaN 2.0 NaN 2.0 NaN
In [107]: df.isnull().mean()
Out[107]:
a 0.80
b 0.55
c 0.85
d 0.70
e 0.75
dtype: float64
Solution:
解决方案:
In [108]: df.columns[df.isnull().mean() < 0.8]
Out[108]: Index(['b', 'd', 'e'], dtype='object')
In [109]: df[df.columns[df.isnull().mean() < 0.8]]
Out[109]:
b d e
0 2.0 NaN NaN
1 NaN NaN 2.0
2 2.0 NaN NaN
3 NaN NaN 2.0
4 2.0 NaN NaN
5 NaN NaN NaN
6 2.0 NaN NaN
7 2.0 NaN NaN
8 2.0 NaN NaN
9 NaN NaN NaN
10 2.0 NaN 2.0
11 NaN 2.0 NaN
12 NaN 2.0 NaN
13 NaN 2.0 NaN
14 NaN 2.0 2.0
15 NaN NaN NaN
16 2.0 NaN NaN
17 NaN NaN 2.0
18 NaN 2.0 NaN
19 2.0 2.0 NaN
回答by Anton vBR
Following MaxU example this is the option for filtering rows:
以下 MaxU 示例是用于过滤行的选项:
df = pd.DataFrame(np.random.choice([2,np.nan], (5,10), p=[0.2, 0.8]), columns=list('abcdefghij'))
a b c d e f g h i j
0 NaN NaN NaN NaN NaN 2.0 NaN NaN NaN 2.0
1 NaN 2.0 NaN 2.0 NaN NaN 2.0 NaN NaN 2.0
2 NaN NaN 2.0 NaN 2.0 NaN 2.0 2.0 NaN NaN
3 NaN NaN NaN NaN NaN 2.0 NaN NaN NaN 2.0
4 2.0 2.0 2.0 NaN NaN NaN NaN NaN NaN NaN
Rows
行
df.loc[df.isnull().mean(axis=1).lt(0.8)]
a b c d e f g h i j
1 NaN 2.0 NaN 2.0 NaN NaN 2.0 NaN NaN 2.0
2 NaN NaN 2.0 NaN 2.0 NaN 2.0 2.0 NaN NaN
4 2.0 2.0 2.0 NaN NaN NaN NaN NaN NaN NaN
回答by singmotor
You can use pandas dropna.
您可以使用Pandas dropna。
limitPer = len(yourdf) * .80
yourdf = yourdf.dropna(thresh=limitPer,axis=1)
回答by vealkind
To generalize within pandas you can do the following to calculate the percent of values in a column with missing values. From those columns you can filter out the features with more than 80% NULL values and then drop those columns from the DataFrame.
要在 Pandas 中进行概括,您可以执行以下操作来计算具有缺失值的列中值的百分比。从这些列中,您可以过滤掉超过 80% NULL 值的特征,然后从 DataFrame 中删除这些列。
pct_null = df.isnull().sum() / len(df)
missing_features = pct_null[pct_null > 0.80].index
df.drop(missing_features, axis=1, inplace=True)
回答by Suhas_Pote
Here is a simple function which you can use directly by passing dataframe and threshold
这是一个简单的函数,您可以通过传递数据框和阈值来直接使用它
def rmissingvaluecol(dff,threshold):
l = []
l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
return l
rmissingvaluecol(df,80) #Here threshold is 80% which means we are going to drop columns having more than 80% of missing values
#output
'''
# Columns having more than 60 percent missing values: 2
Columns:
['id', 'location']
'''
Now create new dataframe excluding these columns
现在创建不包括这些列的新数据框
l = rmissingvaluecol(df,49)
df1 = df[l]
Bonus step
奖励步骤
You can find the percentage of missing values for each column (optional)
您可以找到每列缺失值的百分比(可选)
def missing(dff):
print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(df)
#output
'''
id 83.33
location 83.33
owner 16.67
pets 16.67
dtype: float64
'''
回答by BP34500
The fastest way to find the sum of NaN or the percentage by columns is :
按列查找 NaN 的总和或百分比的最快方法是:
- for the sum : df.isna().sum()
- for the percentage : df.isna().mean()
- 总和:df.isna().sum()
- 对于百分比:df.isna().mean()
回答by mannem srinivas
def show_null_columns(data,agg,threshold):
if agg=='sum':
null_cols=data.isnull().sum()
elif agg=='mean':
null_cols=data.isnull().mean()
columns=data.columns
null_dic={}
for col,x in zip(columns,null_cols):
if x>=threshold:
null_dic[col]=x
return null_dic
null_dic=show_null_columns(train,'mean',0.8)
train2=train.drop(null_dic.keys(),axis=1)