pandas 熊猫选择所有没有 NaN 的列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47414848/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:48:31  来源:igfitidea点击:

Pandas select all columns without NaN

pythonpandasselectnull

提问by Hristo Stoychev

I have a DF with 200 columns. Most of them are with NaN's. I would like to select all columns with no NaN's or at least with the minimum NaN's. I've tried to drop all with a threshold or with notnull() but without success. Any ideas.

我有一个 200 列的 DF。它们中的大多数都带有 NaN。我想选择没有 NaN 或至少具有最小 NaN 的所有列。我试图用阈值或 notnull() 删除所有内容,但没有成功。有任何想法吗。

df.dropna(thresh=2, inplace=True)
df_notnull = df[df.notnull()]

DF for example:

DF例如:

col1  col2 col3
23     45  NaN
54     39  NaN
NaN    45  76
87     32  NaN

The output should look like:

输出应如下所示:

 df.dropna(axis=1, thresh=2)

    col1  col2
    23     45  
    54     39  
    NaN    45  
    87     32  

回答by Vaishali

You can create with non-NaN columns using

您可以使用非 NaN 列创建

df = df[df.columns[~df.isnull().all()]]

Or

或者

null_cols = df.columns[df.isnull().all()]
df.drop(null_cols, axis = 1, inplace = True)

If you wish to remove columns based on a certain percentage of NaNs, say columns with more than 90% data as null

如果您希望根据一定百分比的 NaN 删除列,请将数据超过 90% 的列设为空

cols_to_delete = df.columns[df.isnull().sum()/len(df) > .90]
df.drop(cols_to_delete, axis = 1, inplace = True)

回答by lmolina

I assume that you wan't to get all the columns without any NaN. If that's the case, you can first get the name of the columns without any NaN using ~col.isnull.any(), then use that your columns.

我假设您不想在没有任何 NaN 的情况下获得所有列。如果是这种情况,您可以首先使用 获取没有任何 NaN 的列的名称~col.isnull.any(),然后使用您的列。

I can think in the following code:

我可以在以下代码中思考:

import pandas as pd

df = pd.DataFrame({
    'col1': [23, 54, pd.np.nan, 87],
    'col2': [45, 39, 45, 32],
    'col3': [pd.np.nan, pd.np.nan, 76, pd.np.nan,]
})

# This function will check if there is a null value in the column
def has_nan(col, threshold=0):
    return col.isnull().sum() > threshold

# Then you apply the "complement" of function to get the column with
# no NaN.

df.loc[:, ~df.apply(has_nan)]

# ... or pass the threshold as parameter, if needed
df.loc[:, ~df.apply(has_nan, args=(2,))]

回答by noname

df[df.columns[~df.isnull().any()]]will give you a DataFrame with only the columns that have no null values, and should be the solution.

df[df.columns[~df.isnull().any()]]会给你一个只有没有空值的列的数据帧,应该是解决方案。

df[df.columns[~df.isnull().all()]]only removes the columns that have nothing but null values and leaves columns with even one non-null value.

df[df.columns[~df.isnull().all()]]只删除只有空值的列,并留下一个非空值的列。

df.isnull()will return a dataframe of booleans with the same shape as df. These bools will be True if the particular value is null and False if it isn't.

df.isnull()将返回一个与 df 形状相同的布尔值数据帧。如果特定值为 null,则这些 bool 将为 True,否则为 False。

df.isnull().any()will return True for all columns with even one null. This is where I'm diverging from the accepted answer, as df.isnull().all()will not flag columns with even one value!

df.isnull().any()将为所有具有甚至一个空值的列返回 True。这就是我与接受的答案不同的地方,因为df.isnull().all()不会用一个值标记列!

回答by Alex

you should try df_notnull = df.dropna(how='all')This will get you only non null rows.

你应该试试df_notnull = df.dropna(how='all')这只会让你得到非空行。

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

回答by MJP

null_series = df.isnull().sum() # The number of missing values from each column in your dataframe
full_col_series = null_series[null_series == 0] # Will keep only the columns with no missing values

df = df[full_col_series.index]

回答by Suhas_Pote

Here is a simple function which you can use directly by passing dataframe and threshold

这是一个简单的函数,您可以通过传递数据框和阈值来直接使用它

df
'''
     pets   location     owner     id
0     cat  San_Diego     Champ  123.0
1     dog        NaN       Ron    NaN
2     cat        NaN     Brick    NaN
3  monkey        NaN     Champ    NaN
4  monkey        NaN  Veronica    NaN
5     dog        NaN      John    NaN
'''


def rmissingvaluecol(dff,threshold):
    l = []
    l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
    print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
    print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
    return l


rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values

#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
 ['id', 'location']
'''

Now create new dataframe excluding these columns

现在创建不包括这些列的新数据框

l = rmissingvaluecol(df,1)
df1 = df[l]

PS: You can change threshold as per your requirement

PS:您可以根据您的要求更改阈值

Bonus step

奖励步骤

You can find the percentage of missing values for each column (optional)

您可以找到每列缺失值的百分比(可选)

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))

missing(df)

#output
'''
id          83.33
location    83.33
owner        0.00
pets         0.00
dtype: float64
'''

回答by Tanweer Mahdi Hasan

This worked for me quite well and probably tailored for your need as well!

这对我来说非常有效,并且可能也适合您的需要!

def nan_weed(df,thresh):
ind = []
i = df.shape[1]
for j in range(0,i-1):
    if df[j].isnull().sum() <= thresh:
        ind.append(j)
return df[ind]