Python 如何根据条件删除熊猫数据框中的列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31614804/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to delete a column in pandas dataframe based on a condition?
提问by Fedorenko Kristina
I have a pandas DataFrame, with many NAN
values in it.
我有一个 Pandas DataFrame,里面有很多NAN
值。
How can I drop columns such that number_of_na_values > 2000
?
我怎样才能删除这样的列number_of_na_values > 2000
?
I tried to do it like that:
我试着这样做:
toRemove = set()
naNumbersPerColumn = df.isnull().sum()
for i in naNumbersPerColumn.index:
if(naNumbersPerColumn[i]>2000):
toRemove.add(i)
for i in toRemove:
df.drop(i, axis=1, inplace=True)
Is there a more elegant way to do it?
有没有更优雅的方法来做到这一点?
采纳答案by n8yoder
Here's another alternative to keep the columns that have less than or equal to the specified number of nans in each column:
这是保留每列中小于或等于指定数量的 nan 的列的另一种选择:
max_number_of_nas = 3000
df = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nas)]
In my tests this seems to be slightly faster than the drop columns method suggested by Jianxun Li in the cases I tested:
在我的测试中,这似乎比在我测试的情况下李建勋建议的 drop columns 方法略快:
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10000,5), columns=list('ABCDE'))
df[df < 0] = np.nan
max_number_of_nans = 5010
%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>> 1000 loops, best of 3: 1.76 ms per loop
%%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>> 100 loops, best of 3: 2.04 ms per loop
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 5), columns=list('ABCDE'))
df[df < 0] = np.nan
max_number_of_nans = 5
%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>> 1000 loops, best of 3: 662 μs per loop
%%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>> 1000 loops, best of 3: 1.08 ms per loop
回答by Jianxun Li
Same logic, but just put all things in one line.
相同的逻辑,但只是将所有内容放在一行中。
import pandas as pd
import numpy as np
# artificial data
# ====================================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10,5), columns=list('ABCDE'))
df[df < 0] = np.nan
A B C D E
0 1.7641 0.4002 0.9787 2.2409 1.8676
1 NaN 0.9501 NaN NaN 0.4106
2 0.1440 1.4543 0.7610 0.1217 0.4439
3 0.3337 1.4941 NaN 0.3131 NaN
4 NaN 0.6536 0.8644 NaN 2.2698
5 NaN 0.0458 NaN 1.5328 1.4694
6 0.1549 0.3782 NaN NaN NaN
7 0.1563 1.2303 1.2024 NaN NaN
8 NaN NaN NaN 1.9508 NaN
9 NaN NaN 0.7775 NaN NaN
# processing: drop columns with no. of NaN > 3
# ====================================
df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > 3)], axis=1)
Out[183]:
B
0 0.4002
1 0.9501
2 1.4543
3 1.4941
4 0.6536
5 0.0458
6 0.3782
7 1.2303
8 NaN
9 NaN