pandas 快速删除只有一个不同值的数据框列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33144813/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
quickly drop dataframe columns with only one distinct value
提问by Alexis Eggermont
Is there a faster way to drop columns that only contain one distinct value than the code below?
有没有比下面的代码更快地删除只包含一个不同值的列的方法?
cols=df.columns.tolist()
for col in cols:
if len(set(df[col].tolist()))<2:
df=df.drop(col, axis=1)
This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.
对于大型数据帧来说,这真的很慢。从逻辑上讲,这会计算每列中值的数量,而实际上它可以在达到 2 个不同的值后停止计数。
回答by Anand S Kumar
You can use Series.unique()method to find out all the unique elements in a column, and for columns whose .unique()returns only 1element, you can drop that. Example -
您可以使用Series.unique()method 找出列中的所有唯一元素,对于.unique()仅返回1元素的列,您可以删除它。例子 -
for col in df.columns:
if len(df[col].unique()) == 1:
df.drop(col,inplace=True,axis=1)
A method that does not do inplace dropping -
一种不进行原地丢弃的方法 -
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
Demo -
演示 -
In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]])
In [155]: for col in df.columns:
.....: if len(df[col].unique()) == 1:
.....: df.drop(col,inplace=True,axis=1)
.....:
In [156]: df
Out[156]:
1
0 2
1 3
2 2
Timing results -
计时结果 -
In [166]: %paste
def func1(df):
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
return res
## -- End pasted text --
In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
In [178]: %timeit func1(df)
1000 loops, best of 3: 1.05 ms per loop
In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
100 loops, best of 3: 8.81 ms per loop
In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
100 loops, best of 3: 5.81 ms per loop
The fastest method still seems to be the method using uniqueand looping through the columns.
最快的方法似乎仍然是使用unique和循环列的方法。
回答by kait
One step:
一步:
df = df[[c for c
in list(df)
if len(df[c].unique()) > 1]]
Two steps:
两步:
Create a list of column names that have more than 1 distinct value.
创建具有 1 个以上不同值的列名称列表。
keep = [c for c
in list(df)
if len(df[c].unique()) > 1]
Drop the columns that are not in 'keep'
删除不在“保留”中的列
df = df[keep]
Note: this step can also be done using a list of columns to drop:
注意:此步骤也可以使用要删除的列列表来完成:
drop_cols = [c for c
in list(df)
if df[c].nunique() <= 1]
df = df.drop(columns=drop_cols)
回答by jz0410
df.loc[:,df.apply(pd.Series.nunique) != 1]
For example
例如
In:
df = pd.DataFrame({'A': [10, 20, np.nan, 30], 'B': [10, np.nan, 10, 10]})
df.loc[:,df.apply(pd.Series.nunique) != 1]
Out:
A
0 10
1 20
2 NaN
3 30
回答by EdChum
You can create a mask of your df by calling applyand call value_counts, this will produce NaNfor all rows except one, you can then call dropnacolumn-wise and pass param thresh=2so that there must be 2 or more non-NaNvalues:
您可以通过调用apply和 call创建 df 的掩码value_counts,这将为NaN除一行之外的所有行生成,然后您可以按dropna列调用并传递参数,thresh=2以便必须有 2 个或更多非NaN值:
In [329]:
df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
df
Out[329]:
a b c
0 1 0 0
1 1 1 0
2 1 2 2
3 1 3 2
4 1 4 2
In [342]:
df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
Out[342]:
b c
0 0 0
1 1 0
2 2 2
3 3 2
4 4 2
Output from the boolean conditions:
布尔条件的输出:
In [344]:
df.apply(pd.Series.value_counts)
Out[344]:
a b c
0 NaN 1 2
1 5 1 NaN
2 NaN 1 3
3 NaN 1 NaN
4 NaN 1 NaN
In [345]:
df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
Out[345]:
b c
0 1 2
1 1 NaN
2 1 3
3 1 NaN
4 1 NaN
回答by shantanuo
None of the solutions worked in my use-case because I got this error: (my dataframe contains list item).
没有一个解决方案在我的用例中有效,因为我收到了这个错误:(我的数据框包含列表项)。
TypeError: unhashable type: 'list'
类型错误:不可散列的类型:“列表”
The solution that worked for me is this:
对我有用的解决方案是:
ndf = df.describe(include="all").T
new_cols = set(df.columns) - set(ndf[ndf.unique == 1].index)
df = df[list(new_cols)]
回答by amalik2205
Most 'pythonic' way of doing it I could find:
我能找到的大多数“pythonic”方式:
df = df.loc[:, (df != df.iloc[0]).any()]
回答by vasili111
Many examples in thread and this threaddoes not worked for my df. Those worked:
线程中的许多示例和该线程不适用于我的df. 那些工作:
# from: https://stackoverflow.com/questions/33144813/quickly-drop-dataframe-columns-with-only-one-distinct-value
# from: https://stackoverflow.com/questions/20209600/pandas-dataframe-remove-constant-column
import pandas as pd
import numpy as np
data = {'var1': [1,2,3,4,5,np.nan,7,8,9],
'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'],
'var3':[101,101,101,102,102,102,103,103,np.nan],
'var4':[np.nan,1,1,1,1,1,1,1,1],
'var5':[1,1,1,1,1,1,1,1,1],
'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'var7':["a","a","a","a","a","a","a","a","a"],
'var8': [1,2,3,4,5,6,7,8,9]}
df = pd.DataFrame(data)
df_original = df.copy()
#-------------------------------------------------------------------------------------------------
df2 = df[[c for c
in list(df)
if len(df[c].unique()) > 1]]
#-------------------------------------------------------------------------------------------------
keep = [c for c
in list(df)
if len(df[c].unique()) > 1]
df3 = df[keep]
#-------------------------------------------------------------------------------------------------
keep_columns = [col for col in df.columns if len(df[col].unique()) > 1]
df5 = df[keep_columns].copy()
#-------------------------------------------------------------------------------------------------
for col in df.columns:
if len(df[col].unique()) == 1:
df.drop(col,inplace=True,axis=1)
回答by Ben JW
Another one-liner (inspired from jz0410's answer):
另一个单行(灵感来自 jz0410 的回答):
df.loc[:,df.nunique()!=1]
or inplace (via drop()):
或就地(通过drop()):
df.drop(columns=df.columns[df.nunique()==1], inplace=True)

