pandas 数据框按 nan 的数量删除列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30923324/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:29:37  来源:igfitidea点击:

pandas dataframe drop columns by number of nan

pythonpandas

提问by pyan

I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B' will be kept. How can I implement it?

我有一个数据框,其中一些列包含 nan。我想删除具有一定数量 nan 的那些列。例如,在下面的代码中,我想删除具有 2 个或更多 nan 的任何列。在这种情况下,将删除“C”列,仅保留“A”和“B”。我该如何实施?

import pandas as pd
import numpy as np

dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan

print dff

回答by EdChum

There is a threshparam for dropna, you just need to pass the length of your df - the number of NaNvalues you want as your threshold:

有一个thresh参数dropna,您只需要传递 df 的长度 -NaN您想要作为阈值的值的数量:

In [13]:

dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
          A         B
0  0.517199 -0.806304
1 -0.643074  0.229602
2  0.656728  0.535155
3       NaN -0.162345
4 -0.309663 -0.783539
5  1.244725 -0.274514
6 -0.254232       NaN
7 -1.242430  0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416

So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.

因此,上面将删除任何不符合 df (行数) - 2 长度标准的列作为非 Na 值的数量。

回答by Alexander

You can use a conditional list comprehension:

您可以使用条件列表理解:

>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
          A         B
0 -0.819004  0.919190
1  0.922164  0.088111
2  0.188150  0.847099
3       NaN -0.053563
4  1.327250 -0.376076
5  3.724980  0.292757
6 -0.319342       NaN
7 -1.051529  0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026

回答by stellasia

Here is a possible solution:

这是一个可能的解决方案:

s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
   A    1 
   B    1
   C    3
   dtype: int64

for col in dff: 
   if s[col] >= 2:  
       del dff[col]

Or

或者

for c in dff:
    if sum(dff[c].isnull()) >= 2:
        dff.drop(c, axis=1, inplace=True)

回答by NAGARAJ

I recommend the drop-method. This is an alternative solution:

我推荐drop- 方法。这是一个替代解决方案:

dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)

回答by Ashish

Say you have to drop columns having more than 70% null values.

假设您必须删除空值超过 70% 的列。

data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)