Pandas:查找特定列不是 NA 但所有其他列的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50397644/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:34:23  来源:igfitidea点击:

Pandas: Find rows where a particular column is not NA but all other columns are

pythonpandas

提问by Tom Cooper

I have a DataFrame which contains a lot of NA values. I want to write a query which returns rows where a particular column is not NA but all othercolumns are NA.

我有一个包含很多 NA 值的 DataFrame。我想编写一个查询,该查询返回特定列不是 NA 但所有其他列都是 NA 的行。

I can get a Dataframe where all the column values are not NA easily enough:

我可以得到一个数据框,其中所有列值都不是 NA 很容易:

df[df.interesting_column.notna()]

However, I cant figure out how to then say "from that DataFrame return only rows were every column that is not 'interesting_column' is NA". I can't use .dropnaas all rows and columns will contain at least one NA value.

但是,我不知道如何然后说“从那个 DataFrame 返回的行中,每一列不是‘interesting_column’的列都是 NA”。我不能使用,.dropna因为所有行和列都将包含至少一个 NA 值。

I realise this is probably embarrassingly simple. I have tried lots of .locvariations, join/merges in various configurations and I am not getting anywhere.

我意识到这可能非常简单。我尝试了很多.loc变化,在各种配置中加入/合并,但我一无所获。

Any pointers before I just do a for loop over this thing would be appreciated.

在我对这件事做 for 循环之前的任何指针都将不胜感激。

采纳答案by Ami Tavory

You can simply use a conjunction of the conditions:

您可以简单地使用条件的结合:

df[df.interesting_column.notna() & (df.isnull().sum(axis=1) == len(df.columns) - 1)]
  • df.interesting_column.notna()checks the column is non-null.

  • df.isnull().sum(axis=1) == len(df.columns) - 1checks that the number of nulls in the row is the number of columns minus 1

  • df.interesting_column.notna()检查列是否为非空。

  • df.isnull().sum(axis=1) == len(df.columns) - 1检查行中的空值数是否为列数减 1

Both conditions together mean that the entry in the column is the only one that is non-null.

这两个条件一起意味着列中的条目是唯一一个非空的条目。

回答by Tim Johns

The &operator lets you row-by-row "and" together two boolean columns. Right now, you are using df.interesting_column.notna()to give you a column of TRUEor FALSEvalues. You could repeat this for all columns, using notna()or isna()as desired, and use the &operator to combine the results.

&运算符允许您将两个布尔列逐行“和”在一起。现在,您正在使用df.interesting_column.notna()给您一列TRUEFALSE值。您可以使用notna()isna()根据需要对所有列重复此&操作,并使用运算符组合结果。

For example, if you have columns a, b, and c, and you want to find rows where the value in columns ais not NaNand the values in the other columns are NaN, then do the following:

例如,如果您有列abc,并且您想查找列中的值a不是NaN并且其他列中的值是 的行NaN,请执行以下操作:

df[df.a.notna() & df.b.isna() & df.c.isna()]

This is clear and simple when you have a small number of columns that you know about ahead of time. But, if you have many columns, or if you don't know the column names, you would want a solution that loops over all columns and checks notna()for the interesting_columnand isna()for the other columns. The solution by @AmiTavory is a clever way to achieve this. But, if you didn't know about that solution, here is a simpler approach.

当您提前了解少量列时,这很简单。但是,如果你有很多列,或者,如果你不知道的列名,你想一个解决方案,对所有列和检查循环notna()interesting_columnisna()为其它列。@AmiTavory 的解决方案是实现这一目标的巧妙方法。但是,如果您不知道该解决方案,这里有一个更简单的方法。

for colName in df.columns:
    if colName == "interesting_column":
        df = df[ df[colName].notna() ]
    else:
        df = df[ df[colName].isna() ]

回答by llllllllll

You can use:

您可以使用:

rows = df.drop('interesting_column', axis=1).isna().all(1) & df['interesting_column'].notna()

Example (suppose cis the interesting column):

示例(假设c是有趣的列):

In [99]: df = pd.DataFrame({'a': [1, np.nan, 2], 'b': [1, np.nan, 3], 'c':[4, 5, np.nan]})

In [100]: df
Out[100]: 
     a    b    c
0  1.0  1.0  4.0
1  NaN  NaN  5.0
2  2.0  3.0  NaN

In [101]: rows = df.drop('c', axis=1).isna().all(1) & df.c.notna()

In [102]: rows
Out[102]: 
0    False
1     True
2    False
dtype: bool

In [103]: df[rows]
Out[103]: 
    a   b    c
1 NaN NaN  5.0