pandas 快速删除只有一个不同值的数据框列

Question

提问by Alexis Eggermont

Is there a faster way to drop columns that only contain one distinct value than the code below?

有没有比下面的代码更快地删除只包含一个不同值的列的方法？

cols=df.columns.tolist()
for col in cols:
    if len(set(df[col].tolist()))<2:
        df=df.drop(col, axis=1)

This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.

对于大型数据帧来说，这真的很慢。从逻辑上讲，这会计算每列中值的数量，而实际上它可以在达到 2 个不同的值后停止计数。

Answer 1

回答by Anand S Kumar

You can use Series.unique()method to find out all the unique elements in a column, and for columns whose .unique()returns only 1element, you can drop that. Example -

您可以使用Series.unique()method 找出列中的所有唯一元素，对于.unique()仅返回1元素的列，您可以删除它。例子 -

for col in df.columns:
    if len(df[col].unique()) == 1:
        df.drop(col,inplace=True,axis=1)

A method that does not do inplace dropping -

一种不进行原地丢弃的方法 -

res = df
for col in df.columns:
    if len(df[col].unique()) == 1:
        res = res.drop(col,axis=1)

Demo -

演示 -

In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]])

In [155]: for col in df.columns:
   .....:     if len(df[col].unique()) == 1:
   .....:         df.drop(col,inplace=True,axis=1)
   .....:

In [156]: df
Out[156]:
   1
0  2
1  3
2  2

Timing results -

计时结果 -

In [166]: %paste
def func1(df):
        res = df
        for col in df.columns:
                if len(df[col].unique()) == 1:
                        res = res.drop(col,axis=1)
        return res

## -- End pasted text --

In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})

In [178]: %timeit func1(df)
1000 loops, best of 3: 1.05 ms per loop

In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
100 loops, best of 3: 8.81 ms per loop

In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
100 loops, best of 3: 5.81 ms per loop

The fastest method still seems to be the method using uniqueand looping through the columns.

最快的方法似乎仍然是使用unique和循环列的方法。

Answer 2

回答by kait

One step:

一步：

df = df[[c for c
        in list(df)
        if len(df[c].unique()) > 1]]

Two steps:

两步：

Create a list of column names that have more than 1 distinct value.

创建具有 1 个以上不同值的列名称列表。

keep = [c for c
        in list(df)
        if len(df[c].unique()) > 1]

Drop the columns that are not in 'keep'

删除不在“保留”中的列

df = df[keep]

Note: this step can also be done using a list of columns to drop:

注意：此步骤也可以使用要删除的列列表来完成：

drop_cols = [c for c
             in list(df)
             if df[c].nunique() <= 1]
df = df.drop(columns=drop_cols)

Answer 3

回答by jz0410

df.loc[:,df.apply(pd.Series.nunique) != 1]

For example

例如

In:
df = pd.DataFrame({'A': [10, 20, np.nan, 30], 'B': [10, np.nan, 10, 10]})
df.loc[:,df.apply(pd.Series.nunique) != 1]

Out:
   A
0  10
1  20
2  NaN
3  30

Answer 4

回答by EdChum

You can create a mask of your df by calling applyand call value_counts, this will produce NaNfor all rows except one, you can then call dropnacolumn-wise and pass param thresh=2so that there must be 2 or more non-NaNvalues:

您可以通过调用apply和 call创建 df 的掩码value_counts，这将为NaN除一行之外的所有行生成，然后您可以按dropna列调用并传递参数，thresh=2以便必须有 2 个或更多非NaN值：

In [329]:   
df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
df

Out[329]:
   a  b  c
0  1  0  0
1  1  1  0
2  1  2  2
3  1  3  2
4  1  4  2

In [342]:
df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]

Out[342]:
   b  c
0  0  0
1  1  0
2  2  2
3  3  2
4  4  2

Output from the boolean conditions:

布尔条件的输出：

In [344]:
df.apply(pd.Series.value_counts)

Out[344]:
    a  b   c
0 NaN  1   2
1   5  1 NaN
2 NaN  1   3
3 NaN  1 NaN
4 NaN  1 NaN

In [345]:
df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)

Out[345]:
   b   c
0  1   2
1  1 NaN
2  1   3
3  1 NaN
4  1 NaN

Answer 5

回答by shantanuo

None of the solutions worked in my use-case because I got this error: (my dataframe contains list item).

没有一个解决方案在我的用例中有效，因为我收到了这个错误：（我的数据框包含列表项）。

TypeError: unhashable type: 'list'

类型错误：不可散列的类型：“列表”

The solution that worked for me is this:

对我有用的解决方案是：

ndf = df.describe(include="all").T
new_cols = set(df.columns) - set(ndf[ndf.unique == 1].index)
df = df[list(new_cols)]

Answer 6

回答by amalik2205

Most 'pythonic' way of doing it I could find:

我能找到的大多数“pythonic”方式：

df = df.loc[:, (df != df.iloc[0]).any()]

Answer 7

回答by vasili111

Many examples in thread and this threaddoes not worked for my df. Those worked:

线程中的许多示例和该线程不适用于我的df. 那些工作：

# from: https://stackoverflow.com/questions/33144813/quickly-drop-dataframe-columns-with-only-one-distinct-value
# from: https://stackoverflow.com/questions/20209600/pandas-dataframe-remove-constant-column

import pandas as pd
import numpy as np


data = {'var1': [1,2,3,4,5,np.nan,7,8,9],
       'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'],
       'var3':[101,101,101,102,102,102,103,103,np.nan], 
       'var4':[np.nan,1,1,1,1,1,1,1,1],
       'var5':[1,1,1,1,1,1,1,1,1],
       'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
       'var7':["a","a","a","a","a","a","a","a","a"],
       'var8': [1,2,3,4,5,6,7,8,9]}


df = pd.DataFrame(data)
df_original = df.copy()



#-------------------------------------------------------------------------------------------------


df2 = df[[c for c
        in list(df)
        if len(df[c].unique()) > 1]]


#-------------------------------------------------------------------------------------------------


keep = [c for c
        in list(df)
        if len(df[c].unique()) > 1]

df3 = df[keep]



#-------------------------------------------------------------------------------------------------



keep_columns = [col for col in df.columns if len(df[col].unique()) > 1]

df5 = df[keep_columns].copy()



#-------------------------------------------------------------------------------------------------



for col in df.columns:
     if len(df[col].unique()) == 1:
         df.drop(col,inplace=True,axis=1)

Answer 8

回答by Ben JW

Another one-liner (inspired from jz0410's answer):

另一个单行（灵感来自 jz0410 的回答）：

df.loc[:,df.nunique()!=1]

or inplace (via drop()):

或就地（通过drop()）：

df.drop(columns=df.columns[df.nunique()==1], inplace=True)

pandas 快速删除只有一个不同值的数据框列

提问by Alexis Eggermont

回答by Anand S Kumar

回答by kait

One step:

一步：

Two steps:

两步：

回答by jz0410

回答by EdChum

回答by shantanuo

回答by amalik2205

回答by vasili111

回答by Ben JW

相关推荐

最近更新

标签

pandas 快速删除只有一个不同值的数据框列

提问by Alexis Eggermont

回答by Anand S Kumar

回答by kait

One step:

一步：

Two steps:

两步：

回答by jz0410

回答by EdChum

回答by shantanuo

回答by amalik2205

回答by vasili111

回答by Ben JW

相关推荐

pandas 如何阅读 Excel 工作簿（熊猫）

pandas isin 熊猫的问题

pandas 百分比格式的 XlsxWriter 错误

Pandas DataFrame 中哪些列是二进制的？

相关推荐

最近更新

标签