Pandas:查找特定列不是 NA 但所有其他列的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50397644/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: Find rows where a particular column is not NA but all other columns are
提问by Tom Cooper
I have a DataFrame which contains a lot of NA values. I want to write a query which returns rows where a particular column is not NA but all othercolumns are NA.
我有一个包含很多 NA 值的 DataFrame。我想编写一个查询,该查询返回特定列不是 NA 但所有其他列都是 NA 的行。
I can get a Dataframe where all the column values are not NA easily enough:
我可以得到一个数据框,其中所有列值都不是 NA 很容易:
df[df.interesting_column.notna()]
However, I cant figure out how to then say "from that DataFrame return only rows were every column that is not 'interesting_column' is NA". I can't use .dropna
as all rows and columns will contain at least one NA value.
但是,我不知道如何然后说“从那个 DataFrame 返回的行中,每一列不是‘interesting_column’的列都是 NA”。我不能使用,.dropna
因为所有行和列都将包含至少一个 NA 值。
I realise this is probably embarrassingly simple. I have tried lots of .loc
variations, join/merges in various configurations and I am not getting anywhere.
我意识到这可能非常简单。我尝试了很多.loc
变化,在各种配置中加入/合并,但我一无所获。
Any pointers before I just do a for loop over this thing would be appreciated.
在我对这件事做 for 循环之前的任何指针都将不胜感激。
采纳答案by Ami Tavory
You can simply use a conjunction of the conditions:
您可以简单地使用条件的结合:
df[df.interesting_column.notna() & (df.isnull().sum(axis=1) == len(df.columns) - 1)]
df.interesting_column.notna()
checks the column is non-null.df.isnull().sum(axis=1) == len(df.columns) - 1
checks that the number of nulls in the row is the number of columns minus 1
df.interesting_column.notna()
检查列是否为非空。df.isnull().sum(axis=1) == len(df.columns) - 1
检查行中的空值数是否为列数减 1
Both conditions together mean that the entry in the column is the only one that is non-null.
这两个条件一起意味着列中的条目是唯一一个非空的条目。
回答by Tim Johns
The &
operator lets you row-by-row "and" together two boolean columns. Right now, you are using df.interesting_column.notna()
to give you a column of TRUE
or FALSE
values. You could repeat this for all columns, using notna()
or isna()
as desired, and use the &
operator to combine the results.
该&
运算符允许您将两个布尔列逐行“和”在一起。现在,您正在使用df.interesting_column.notna()
给您一列TRUE
或FALSE
值。您可以使用notna()
或isna()
根据需要对所有列重复此&
操作,并使用运算符组合结果。
For example, if you have columns a
, b
, and c
, and you want to find rows where the value in columns a
is not NaN
and the values in the other columns are NaN
, then do the following:
例如,如果您有列a
、b
和c
,并且您想查找列中的值a
不是NaN
并且其他列中的值是 的行NaN
,请执行以下操作:
df[df.a.notna() & df.b.isna() & df.c.isna()]
This is clear and simple when you have a small number of columns that you know about ahead of time. But, if you have many columns, or if you don't know the column names, you would want a solution that loops over all columns and checks notna()
for the interesting_column
and isna()
for the other columns. The solution by @AmiTavory is a clever way to achieve this. But, if you didn't know about that solution, here is a simpler approach.
当您提前了解少量列时,这很简单。但是,如果你有很多列,或者,如果你不知道的列名,你想一个解决方案,对所有列和检查循环notna()
的interesting_column
和isna()
为其它列。@AmiTavory 的解决方案是实现这一目标的巧妙方法。但是,如果您不知道该解决方案,这里有一个更简单的方法。
for colName in df.columns:
if colName == "interesting_column":
df = df[ df[colName].notna() ]
else:
df = df[ df[colName].isna() ]
回答by llllllllll
You can use:
您可以使用:
rows = df.drop('interesting_column', axis=1).isna().all(1) & df['interesting_column'].notna()
Example (suppose c
is the interesting column):
示例(假设c
是有趣的列):
In [99]: df = pd.DataFrame({'a': [1, np.nan, 2], 'b': [1, np.nan, 3], 'c':[4, 5, np.nan]})
In [100]: df
Out[100]:
a b c
0 1.0 1.0 4.0
1 NaN NaN 5.0
2 2.0 3.0 NaN
In [101]: rows = df.drop('c', axis=1).isna().all(1) & df.c.notna()
In [102]: rows
Out[102]:
0 False
1 True
2 False
dtype: bool
In [103]: df[rows]
Out[103]:
a b c
1 NaN NaN 5.0