在 Pandas Dataframe 中删除标准差较低的列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31799187/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Drop columns with low standard deviation in Pandas Dataframe
提问by Ashkan
Is there any way of doing this without writing a for loop?
有没有办法在不编写 for 循环的情况下做到这一点?
Suppose we have the following data:
假设我们有以下数据:
d = {'A': {-1: 0.19052041339798062,
0: -0.0052531481871952871,
1: -0.0022017467720961644,
2: -0.051109629013311737,
3: 0.18569441222621336},
'B': {-1: 0.029181417300734112,
0: -0.0031021862533310743,
1: -0.014358516787430284,
2: 0.0046386615308068877,
3: 0.056676322314857898},
'C': {-1: 0.071883343375205785,
0: -0.011930096520251999,
1: -0.011836365865654104,
2: -0.0033930358388315237,
3: 0.11812543193496111},
'D': {-1: 0.17670604006475121,
0: -0.088756293654161142,
1: -0.093383245649534194,
2: 0.095649943383654359,
3: 0.51030339029516592},
'E': {-1: 0.30273513342295627,
0: -0.30640233455497284,
1: -0.32698263145105921,
2: 0.60257484810641992,
3: 0.36859978928328413},
'F': {-1: 0.25328469046380131,
0: -0.063890702001567143,
1: -0.10007720832198815,
2: 0.08153164759036724,
3: 0.36606175240021183},
'G': {-1: 0.28764606940509913,
0: -0.11022209861109525,
1: -0.1264164305949009,
2: 0.17030074112227081,
3: 0.30100292424380881}}
df = pd.DataFrame(d)
I know I can get the std values by std_vals = df.std(), which gives the following result, and use these values to drop the columns one by one.
我知道我可以通过 获得 std 值std_vals = df.std(),它给出以下结果,并使用这些值一一删除列。
In[]:
pd.DataFrame(d).std()
Out[]:
A 0.115374
B 0.028435
C 0.059394
D 0.247617
E 0.421117
F 0.200776
G 0.209710
dtype: float64
However, I don't know how to use the Pandas indexing to drop the columns with low std values directly.
但是,我不知道如何使用 Pandas 索引直接删除具有低标准值的列。
Is there a way to do this, or I need to loop over each column?
有没有办法做到这一点,或者我需要遍历每一列?
回答by maxymoo
You can use the locmethod of a dataframe to select certain columns based on a Boolean indexer. Create the indexer like this (uses Numpy Array broadcasting to apply the condition to each column):
您可以使用loc数据框的方法基于布尔索引器选择某些列。像这样创建索引器(使用 Numpy Array 广播将条件应用于每一列):
df.std() > 0.3
Out[84]:
A False
B False
C False
D False
E True
F False
G False
dtype: bool
Then call locwith :in the first position to indicate that you want to return all rows:
然后在第一个位置调用locwith:表示要返回所有行:
df.loc[:, df.std() > .3]
Out[85]:
E
-1 0.302735
0 -0.306402
1 -0.326983
2 0.602575
3 0.368600
回答by Jianxun Li
To drop columns, You need those column names.
要删除列,您需要这些列名。
threshold = 0.2
df.drop(df.std()[df.std() < threshold].index.values, axis=1)
D E F G
-1 0.1767 0.3027 0.2533 0.2876
0 -0.0888 -0.3064 -0.0639 -0.1102
1 -0.0934 -0.3270 -0.1001 -0.1264
2 0.0956 0.6026 0.0815 0.1703
3 0.5103 0.3686 0.3661 0.3010

