pandas 过滤掉超过一定数量的 NaN 的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/23203638/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Filter out rows with more than certain number of NaN
提问by AMM
In a Pandas dataframe, I would like to filter out all the rows that have more than 2 NaNs.
在 Pandas 数据框中,我想过滤掉所有超过 2NaN秒的行。
Essentially, I have 4 columns and I would like to keep only those rows where at least 2 columns have finite values.
本质上,我有 4 列,我只想保留那些至少有 2 列具有有限值的行。
Can somebody advise on how to achieve this?
有人可以就如何实现这一目标提出建议吗?
回答by EdChum
The following should work
以下应该工作
df.dropna(thresh=2)
See the online docs
查看在线文档
What we are doing here is dropping any NaNrows, where there are 2 or more non NaNvalues in a row.
我们在这里所做的是删除任何NaN行,其中一行中有 2 个或更多非NaN值。
Example:
例子:
In [25]:
import pandas as pd
df = pd.DataFrame({'a':[1,2,NaN,4,5], 'b':[NaN,2,NaN,4,5], 'c':[1,2,NaN,NaN,NaN], 'd':[1,2,3,NaN,5]})
df
Out[25]:
    a   b   c   d
0   1 NaN   1   1
1   2   2   2   2
2 NaN NaN NaN   3
3   4   4 NaN NaN
4   5   5 NaN   5
[5 rows x 4 columns]
In [26]:
df.dropna(thresh=2)
Out[26]:
   a   b   c   d
0  1 NaN   1   1
1  2   2   2   2
3  4   4 NaN NaN
4  5   5 NaN   5
[4 rows x 4 columns]
EDIT
编辑
For the above example it works but you should note that you would have to know the number of columns and set the threshvalue appropriately, I thought originally it meant the number of NaNvalues but it actually means number of NonNaNvalues.
对于上面的示例,它有效,但您应该注意,您必须知道列数并thresh适当设置值,我认为最初它表示NaN值的数量,但实际上表示非NaN值的数量。
回答by jpp
You have phrased 2 slightly different questions here. In the generalcase, they have different answers.
您在这里提出了 2 个略有不同的问题。在一般情况下,他们有不同的答案。
I would like to keep only those rows where at least 2 columns have finite values.
我只想保留那些至少有 2 列具有有限值的行。
df = df.dropna(thresh=2)
This keepsrows with 2 or more non-null values.
这将保留具有 2 个或更多非空值的行。
I would like to filter out all the rows that have more than 2
NaNs
我想过滤掉所有超过 2 的行
NaNs
df = df.dropna(thresh=df.shape[1]-2)
This filters outrows with 2 or more null values.
这会过滤掉具有 2 个或更多空值的行。
In your example dataframe of 4 columns, these operations are equivalent, since df.shape[1] - 2 == 2. However, you will notice discrepancies with dataframes which do not have exactly 4 columns.
在您的 4 列示例数据框中,这些操作是等效的,因为df.shape[1] - 2 == 2. 但是,您会注意到数据帧的差异不完全是 4 列。
Note dropnaalso has a subsetargument should you wish to include only specified columns when applying a threshold. For example:
如果您希望在应用阈值时仅包含指定的列,请注意dropna也有一个subset参数。例如:
df = df.dropna(subset=['col1', 'col2', 'col3'], thresh=2)
回答by Grant Shannon
I had a slightly different problem i.e. to filter out columnswith more than certain number of NaN:
我有一个稍微不同的问题,即筛选出列比一定数量的NaN的更多:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,np.nan,4,5], 'b':[np.nan,2,np.nan,4,5], 'c':[1,2,np.nan,np.nan,np.nan], 'd':[1,2,3,np.nan,5]})
df
    a   b   c   d
0   1.0 NaN 1.0 1.0
1   2.0 2.0 2.0 2.0
2   NaN NaN NaN 3.0
3   4.0 4.0 NaN NaN
4   5.0 5.0 NaN 5.0
Assume you want to filter out columns with 3 or more Nan's:
假设您要过滤掉包含 3 个或更多 Nan 的列:
num_rows = df.shape[0]
drop_cols_with_this_amount_of_nans_or_more = 3
keep_cols_with_at_least_this_number_of_non_nans = num_rows - drop_cols_with_this_amount_of_nans_or_more + 1
df.dropna(axis=1,thresh=keep_cols_with_at_least_this_number_of_non_nans)
output: (column c has been dropped as expected):
输出:(列 c 已按预期删除):
    a   b   d
0   1.0 NaN 1.0
1   2.0 2.0 2.0
2   NaN NaN 3.0
3   4.0 4.0 NaN
4   5.0 5.0 5.0

