Pandas DataFrame 中哪些列是二进制的?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32982034/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:59:38  来源:igfitidea点击:

Which columns are binary in a Pandas DataFrame?

pythonnumpypandas

提问by na899

I have a pandas dataframe with a large number of columns and I need to find which columns are binary (with values 0 or 1 only) without looking at the data. Which function should be used?

我有一个包含大量列的 Pandas 数据框,我需要在不查看数据的情况下找到哪些列是二进制的(只有值 0 或 1)。应该使用哪个功能?

回答by Alexander

To my knowledge, there is no direct function to test for this. Rather, you need to build something based on how the data was encoded (e.g. 1/0, T/F, True/False, etc.). In addition, if your column has a missing value, the entire column will be encoded as a float instead of an int.

据我所知,没有直接的函数来测试这个。相反,您需要根据数据的编码方式(例如 1/0、T/F、True/False 等)构建一些东西。此外,如果您的列有缺失值,整个列将被编码为浮点数而不是整数。

In the example below, I test whether all unique non null values are either '1' or '0'. It returns a list of all such columns.

在下面的示例中,我测试所有唯一的非空值是“1”还是“0”。它返回所有此类列的列表。

df = pd.DataFrame({'bool': [1, 0, 1, None], 
                   'floats': [1.2, 3.1, 4.4, 5.5], 
                   'ints': [1, 2, 3, 4], 
                   'str': ['a', 'b', 'c', 'd']})

bool_cols = [col for col in df 
             if df[[col]].dropna().unique().isin([0, 1]).all().values]

# 2019-09-10 EDIT (per Hardik Gupta)
bool_cols = [col for col in df 
             if np.isin(df[col].dropna().unique(), [0, 1]).all()]

>>> bool_cols
['bool']

>>> df[bool_cols]
   bool
0     1
1     0
2     1
3   NaN

回答by lucas

def is_binary(series, allow_na=False):
    if allow_na:
        series.dropna(inplace=True)
    return sorted(series.unique()) == [0, 1]

This is the most efficient solution I found. It is quicker than the answers above. When handling large data sets, the difference in timing becomes relevant.

这是我找到的最有效的解决方案。它比上面的答案更快。在处理大型数据集时,时间差异变得重要。

回答by Aiden

To expand on the answer just above, using value_counts().index instead of unique() should do the trick:

为了扩展上面的答案,使用 value_counts().index 而不是 unique() 应该可以解决问题:

bool_cols = [col for col in df if 
               df[col].dropna().value_counts().index.isin([0,1]).all()]

回答by sedeh

Improving upon @Aiden to avoid returning an empty column:

改进@Aiden 以避免返回空列:

[col for col in df if (len(df[col].value_counts()) > 0) & all(df[col].value_counts().index.isin([0, 1]))]

回答by Hardik Gupta

Using Alexander's answer, with python version - 3.6.6

使用 Alexander 的回答,python 版本 - 3.6.6

[col for col in df if np.isin(df[col].unique(), [0, 1]).all()]