pandas 熊猫选择所有没有 NaN 的列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47414848/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas select all columns without NaN
提问by Hristo Stoychev
I have a DF with 200 columns. Most of them are with NaN's. I would like to select all columns with no NaN's or at least with the minimum NaN's. I've tried to drop all with a threshold or with notnull() but without success. Any ideas.
我有一个 200 列的 DF。它们中的大多数都带有 NaN。我想选择没有 NaN 或至少具有最小 NaN 的所有列。我试图用阈值或 notnull() 删除所有内容,但没有成功。有任何想法吗。
df.dropna(thresh=2, inplace=True)
df_notnull = df[df.notnull()]
DF for example:
DF例如:
col1 col2 col3
23 45 NaN
54 39 NaN
NaN 45 76
87 32 NaN
The output should look like:
输出应如下所示:
df.dropna(axis=1, thresh=2)
col1 col2
23 45
54 39
NaN 45
87 32
回答by Vaishali
You can create with non-NaN columns using
您可以使用非 NaN 列创建
df = df[df.columns[~df.isnull().all()]]
Or
或者
null_cols = df.columns[df.isnull().all()]
df.drop(null_cols, axis = 1, inplace = True)
If you wish to remove columns based on a certain percentage of NaNs, say columns with more than 90% data as null
如果您希望根据一定百分比的 NaN 删除列,请将数据超过 90% 的列设为空
cols_to_delete = df.columns[df.isnull().sum()/len(df) > .90]
df.drop(cols_to_delete, axis = 1, inplace = True)
回答by lmolina
I assume that you wan't to get all the columns without any NaN. If that's the case, you can first get the name of the columns without any NaN using ~col.isnull.any()
, then use that your columns.
我假设您不想在没有任何 NaN 的情况下获得所有列。如果是这种情况,您可以首先使用 获取没有任何 NaN 的列的名称~col.isnull.any()
,然后使用您的列。
I can think in the following code:
我可以在以下代码中思考:
import pandas as pd
df = pd.DataFrame({
'col1': [23, 54, pd.np.nan, 87],
'col2': [45, 39, 45, 32],
'col3': [pd.np.nan, pd.np.nan, 76, pd.np.nan,]
})
# This function will check if there is a null value in the column
def has_nan(col, threshold=0):
return col.isnull().sum() > threshold
# Then you apply the "complement" of function to get the column with
# no NaN.
df.loc[:, ~df.apply(has_nan)]
# ... or pass the threshold as parameter, if needed
df.loc[:, ~df.apply(has_nan, args=(2,))]
回答by noname
df[df.columns[~df.isnull().any()]]
will give you a DataFrame with only the columns that have no null values, and should be the solution.
df[df.columns[~df.isnull().any()]]
会给你一个只有没有空值的列的数据帧,应该是解决方案。
df[df.columns[~df.isnull().all()]]
only removes the columns that have nothing but null values and leaves columns with even one non-null value.
df[df.columns[~df.isnull().all()]]
只删除只有空值的列,并留下一个非空值的列。
df.isnull()
will return a dataframe of booleans with the same shape as df. These bools will be True if the particular value is null and False if it isn't.
df.isnull()
将返回一个与 df 形状相同的布尔值数据帧。如果特定值为 null,则这些 bool 将为 True,否则为 False。
df.isnull().any()
will return True for all columns with even one null. This is where I'm diverging from the accepted answer, as df.isnull().all()
will not flag columns with even one value!
df.isnull().any()
将为所有具有甚至一个空值的列返回 True。这就是我与接受的答案不同的地方,因为df.isnull().all()
不会用一个值标记列!
回答by Alex
you should try df_notnull = df.dropna(how='all')
This will get you only non null rows.
你应该试试df_notnull = df.dropna(how='all')
这只会让你得到非空行。
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
回答by MJP
null_series = df.isnull().sum() # The number of missing values from each column in your dataframe
full_col_series = null_series[null_series == 0] # Will keep only the columns with no missing values
df = df[full_col_series.index]
回答by Suhas_Pote
Here is a simple function which you can use directly by passing dataframe and threshold
这是一个简单的函数,您可以通过传递数据框和阈值来直接使用它
df
'''
pets location owner id
0 cat San_Diego Champ 123.0
1 dog NaN Ron NaN
2 cat NaN Brick NaN
3 monkey NaN Champ NaN
4 monkey NaN Veronica NaN
5 dog NaN John NaN
'''
def rmissingvaluecol(dff,threshold):
l = []
l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
return l
rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values
#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
['id', 'location']
'''
Now create new dataframe excluding these columns
现在创建不包括这些列的新数据框
l = rmissingvaluecol(df,1)
df1 = df[l]
PS: You can change threshold as per your requirement
PS:您可以根据您的要求更改阈值
Bonus step
奖励步骤
You can find the percentage of missing values for each column (optional)
您可以找到每列缺失值的百分比(可选)
def missing(dff):
print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(df)
#output
'''
id 83.33
location 83.33
owner 0.00
pets 0.00
dtype: float64
'''
回答by Tanweer Mahdi Hasan
This worked for me quite well and probably tailored for your need as well!
这对我来说非常有效,并且可能也适合您的需要!
def nan_weed(df,thresh):
ind = []
i = df.shape[1]
for j in range(0,i-1):
if df[j].isnull().sum() <= thresh:
ind.append(j)
return df[ind]