pandas 熊猫选择所有没有 NaN 的列

Question

提问by Hristo Stoychev

I have a DF with 200 columns. Most of them are with NaN's. I would like to select all columns with no NaN's or at least with the minimum NaN's. I've tried to drop all with a threshold or with notnull() but without success. Any ideas.

我有一个 200 列的 DF。它们中的大多数都带有 NaN。我想选择没有 NaN 或至少具有最小 NaN 的所有列。我试图用阈值或 notnull() 删除所有内容，但没有成功。有任何想法吗。

df.dropna(thresh=2, inplace=True)
df_notnull = df[df.notnull()]

DF for example:

DF例如：

col1  col2 col3
23     45  NaN
54     39  NaN
NaN    45  76
87     32  NaN

The output should look like:

输出应如下所示：

 df.dropna(axis=1, thresh=2)

    col1  col2
    23     45  
    54     39  
    NaN    45  
    87     32

Answer 1

回答by Vaishali

You can create with non-NaN columns using

您可以使用非 NaN 列创建

df = df[df.columns[~df.isnull().all()]]

Or

或者

null_cols = df.columns[df.isnull().all()]
df.drop(null_cols, axis = 1, inplace = True)

If you wish to remove columns based on a certain percentage of NaNs, say columns with more than 90% data as null

如果您希望根据一定百分比的 NaN 删除列，请将数据超过 90% 的列设为空

cols_to_delete = df.columns[df.isnull().sum()/len(df) > .90]
df.drop(cols_to_delete, axis = 1, inplace = True)

Answer 2

回答by lmolina

I assume that you wan't to get all the columns without any NaN. If that's the case, you can first get the name of the columns without any NaN using ~col.isnull.any(), then use that your columns.

我假设您不想在没有任何 NaN 的情况下获得所有列。如果是这种情况，您可以首先使用获取没有任何 NaN 的列的名称~col.isnull.any()，然后使用您的列。

I can think in the following code:

我可以在以下代码中思考：

import pandas as pd

df = pd.DataFrame({
    'col1': [23, 54, pd.np.nan, 87],
    'col2': [45, 39, 45, 32],
    'col3': [pd.np.nan, pd.np.nan, 76, pd.np.nan,]
})

# This function will check if there is a null value in the column
def has_nan(col, threshold=0):
    return col.isnull().sum() > threshold

# Then you apply the "complement" of function to get the column with
# no NaN.

df.loc[:, ~df.apply(has_nan)]

# ... or pass the threshold as parameter, if needed
df.loc[:, ~df.apply(has_nan, args=(2,))]

Answer 3

回答by noname

df[df.columns[~df.isnull().any()]]will give you a DataFrame with only the columns that have no null values, and should be the solution.

df[df.columns[~df.isnull().any()]]会给你一个只有没有空值的列的数据帧，应该是解决方案。

df[df.columns[~df.isnull().all()]]only removes the columns that have nothing but null values and leaves columns with even one non-null value.

df[df.columns[~df.isnull().all()]]只删除只有空值的列，并留下一个非空值的列。

df.isnull()will return a dataframe of booleans with the same shape as df. These bools will be True if the particular value is null and False if it isn't.

df.isnull()将返回一个与 df 形状相同的布尔值数据帧。如果特定值为 null，则这些 bool 将为 True，否则为 False。

df.isnull().any()will return True for all columns with even one null. This is where I'm diverging from the accepted answer, as df.isnull().all()will not flag columns with even one value!

df.isnull().any()将为所有具有甚至一个空值的列返回 True。这就是我与接受的答案不同的地方，因为df.isnull().all()不会用一个值标记列！

Answer 4

回答by Alex

you should try df_notnull = df.dropna(how='all')This will get you only non null rows.

你应该试试df_notnull = df.dropna(how='all')这只会让你得到非空行。

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

Answer 5

回答by MJP

null_series = df.isnull().sum() # The number of missing values from each column in your dataframe
full_col_series = null_series[null_series == 0] # Will keep only the columns with no missing values

df = df[full_col_series.index]

Answer 6

回答by Suhas_Pote

Here is a simple function which you can use directly by passing dataframe and threshold

这是一个简单的函数，您可以通过传递数据框和阈值来直接使用它

df
'''
     pets   location     owner     id
0     cat  San_Diego     Champ  123.0
1     dog        NaN       Ron    NaN
2     cat        NaN     Brick    NaN
3  monkey        NaN     Champ    NaN
4  monkey        NaN  Veronica    NaN
5     dog        NaN      John    NaN
'''

def rmissingvaluecol(dff,threshold):
    l = []
    l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
    print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
    print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
    return l


rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values

#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
 ['id', 'location']
'''

Now create new dataframe excluding these columns

现在创建不包括这些列的新数据框

l = rmissingvaluecol(df,1)
df1 = df[l]

PS: You can change threshold as per your requirement

PS：您可以根据您的要求更改阈值

Bonus step

奖励步骤

You can find the percentage of missing values for each column (optional)

您可以找到每列缺失值的百分比（可选）

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))

missing(df)

#output
'''
id          83.33
location    83.33
owner        0.00
pets         0.00
dtype: float64
'''

Answer 7

回答by Tanweer Mahdi Hasan

This worked for me quite well and probably tailored for your need as well!

这对我来说非常有效，并且可能也适合您的需要！

def nan_weed(df,thresh):
ind = []
i = df.shape[1]
for j in range(0,i-1):
    if df[j].isnull().sum() <= thresh:
        ind.append(j)
return df[ind]

pandas 熊猫选择所有没有 NaN 的列

提问by Hristo Stoychev

回答by Vaishali

回答by lmolina

回答by noname

回答by Alex

回答by MJP

回答by Suhas_Pote

Here is a simple function which you can use directly by passing dataframe and threshold

这是一个简单的函数，您可以通过传递数据框和阈值来直接使用它

Bonus step

奖励步骤

回答by Tanweer Mahdi Hasan

相关推荐

最近更新

标签

pandas 熊猫选择所有没有 NaN 的列

提问by Hristo Stoychev

回答by Vaishali

回答by lmolina

回答by noname

回答by Alex

回答by MJP

回答by Suhas_Pote

Here is a simple function which you can use directly by passing dataframe and threshold

这是一个简单的函数，您可以通过传递数据框和阈值来直接使用它

Bonus step

奖励步骤

回答by Tanweer Mahdi Hasan

相关推荐

pandas 熊猫：GroupBy .pipe() 与 .apply()

pandas 取列的对数

pandas 取熊猫系列中每 N 行的总和

pandas 更改 Matplotlib 图表 X 轴上的顺序

相关推荐

最近更新

标签