删除 Pandas 中“空”值超过 60% 的列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49791246/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Drop Columns with more than 60 Percent of "empty" Values in Pandas
提问by Krypt
I have got a dataframe like this:
我有一个这样的数据框:
import pandas as pd
data = {
'c1': ['Test1','Test2','NULL','Test3',' ','Test4','Test4','Test1',"Test3"],
'c2': [' ','Test1',' ','NULL',' ','NULL','NULL','NULL','NULL'],
'c3': [0,0,0,0,0,1,5,0,0],
'c4': ['NULL', 'Test2', 'Test1','Test1', 'Test2', 'Test2','Test1','Test1','Test2']
}
df = pd.DataFrame(data)
df
The dataframe looks like this:
数据框如下所示:
c1 c2 c3 c4
0 Test1 0 NULL
1 Test2 Test1 0 Test2
2 NULL 0 Test1
3 Test3 NULL 0 Test1
4 0 Test2
5 Test4 NULL 1 Test2
6 Test4 NULL 5 Test1
7 Test1 NULL 0 Test1
8 Test3 NULL 0 Test2
I want to drop all columns, that have more than 60 % of "empty" values. "Empty" means in my case that the values are for example: ' ', 'NULL' or 0. There are strings (c1, c2, c4) as well as integers (c3).
我想删除所有具有超过 60% 的“空”值的列。“空”在我的例子中意味着这些值是例如:''、'NULL' 或 0。有字符串 (c1、c2、c4) 以及整数 (c3)。
The result should be a dataframe with columns c1 and c4 only.
结果应该是一个只有 c1 和 c4 列的数据框。
c1 c4
0 Test1 NULL
1 Test2 Test2
2 NULL Test1
3 Test3 Test1
4 Test2
5 Test4 Test2
6 Test4 Test1
7 Test1 Test1
8 Test3 Test2
I have no idea how to handle that problem. Only thing that comes to my mind is something like
我不知道如何处理这个问题。我唯一想到的是
df.loc[:, (df != 0).any(axis=0)]
to delete all columns where all values are 0, 'NULL' and so on.
删除所有值为 0、'NULL' 等的所有列。
回答by jezrael
Use DataFrame.isin
for check all formats and then get mean
for treshold and filter by boolean indexing
with loc
:
使用DataFrame.isin
用于检查所有的格式,然后获得mean
由treshold和过滤boolean indexing
用loc
:
print (df.isin([' ','NULL',0]))
c1 c2 c3 c4
0 False True True True
1 False False True False
2 True True True False
3 False True True False
4 True True True False
5 False True False False
6 False True False False
7 False True True False
8 False True True False
print (df.isin([' ','NULL',0]).mean())
c1 0.222222
c2 0.888889
c3 0.777778
c4 0.111111
dtype: float64
df = df.loc[:, df.isin([' ','NULL',0]).mean() < .6]
print (df)
c1 c4
0 Test1 NULL
1 Test2 Test2
2 NULL Test1
3 Test3 Test1
4 Test2
5 Test4 Test2
6 Test4 Test1
7 Test1 Test1
8 Test3 Test2
回答by shivsn
you can drop the columns using dropnathresh
parameter:
您可以使用dropnathresh
参数删除列:
In [58]: df = df.replace([0,' ','NULL'],np.nan)
In[59]: df
Out[59]:
c1 c2 c3 c4
0 Test1 NaN NaN NaN
1 Test2 Test1 NaN Test2
2 NaN NaN NaN Test1
3 Test3 NaN NaN Test1
4 NaN NaN NaN Test2
5 Test4 NaN 1.0 Test2
6 Test4 NaN 5.0 Test1
7 Test1 NaN NaN Test1
8 Test3 NaN NaN Test2
In [60]: df.dropna(thresh=df.shape[0]*0.6,how='all',axis=1)
Out[60]:
c1 c4
0 Test1 NaN
1 Test2 Test2
2 NaN Test1
3 Test3 Test1
4 NaN Test2
5 Test4 Test2
6 Test4 Test1
7 Test1 Test1
8 Test3 Test2