在 Pandas 数据框中检索 NaN 值的索引
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33641231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Retrieve indices of NaN values in a pandas dataframe
提问by dooms
I try to retrieve for each row containing NaN values all the indices of the corresponding columns.
我尝试为包含 NaN 值的每一行检索相应列的所有索引。
d=[[11.4,1.3,2.0, NaN],[11.4,1.3,NaN, NaN],[11.4,1.3,2.8, 0.7],[NaN,NaN,2.8, 0.7]]
df = pd.DataFrame(data=d, columns=['A','B','C','D'])
print df
A B C D
0 11.4 1.3 2.0 NaN
1 11.4 1.3 NaN NaN
2 11.4 1.3 2.8 0.7
3 NaN NaN 2.8 0.7
I've already done the following :
我已经做了以下事情:
- add a column with the count of NaN for each row
- get the indices of each row containing NaN values
- 为每行添加一个包含 NaN 计数的列
- 获取包含 NaN 值的每一行的索引
What I want (ideally the name of the column) is get a list like this :
我想要的(最好是列的名称)是得到一个这样的列表:
[ ['D'],['C','D'],['A','B'] ]
Hope I can find a way without doing for each row the test for each column
希望我能找到一种方法,而无需对每一行进行每一列的测试
if df.ix[i][column] == NaN:
I'm looking for a pandas way to be able to deal with my huge dataset.
我正在寻找一种能够处理我庞大数据集的Pandas方式。
Thanks in advance.
提前致谢。
采纳答案by Andy Hayden
Another way, extract the rows which are NaN:
另一种方法,提取 NaN 的行:
In [11]: df_null = df.isnull().unstack()
In [12]: t = df_null[df_null]
In [13]: t
Out[13]:
A 3 True
B 3 True
C 1 True
D 0 True
1 True
dtype: bool
This gets you most of the way and may be enough.
Although it may be easier to work with the Series:
这会让您获得大部分方法并且可能就足够了。
虽然使用系列可能更容易:
In [14]: s = pd.Series(t2.index.get_level_values(1), t2.index.get_level_values(0))
In [15]: s
Out[15]:
0 D
1 C
1 D
3 A
3 B
dtype: object
e.g. if you wanted the lists (though I don't think you would need them)
例如,如果您想要列表(尽管我认为您不需要它们)
In [16]: s.groupby(level=0).apply(list)
Out[16]:
0 [D]
1 [C, D]
3 [A, B]
dtype: object
回答by maxymoo
It should be efficient to use a scipy coordinate-format sparse matrix to retrieve the coordinates of the null values:
使用 scipy 坐标格式稀疏矩阵来检索空值的坐标应该是有效的:
import scipy.sparse as sp
x,y = sp.coo_matrix(df.isnull()).nonzero()
print(list(zip(x,y)))
[(0, 3), (1, 2), (1, 3), (3, 0), (3, 1)]
Note that I'm calling the nonzero
method in order to just output the coordinates of the nonzero entries in the underlying sparse matrix since I don't care about the actual values which are all True
.
请注意,我调用该nonzero
方法是为了仅输出底层稀疏矩阵中非零条目的坐标,因为我不关心所有True
.
回答by Alexander
You can iterate through each row in the dataframe, create a mask of null values, and output their index (i.e. the columns in the dataframe).
您可以遍历数据帧中的每一行,创建一个空值掩码,并输出它们的索引(即数据帧中的列)。
lst = []
for _, row in df.iterrows():
mask = row.isnull()
lst += [row[mask].index.tolist()]
>>> lst
[['D'], ['C', 'D'], [], ['A', 'B']]
回答by muon
another simpler way is:
另一种更简单的方法是:
>>>df.isnull().any(axis=1)
0 True
1 True
2 False
3 True
dtype: bool
to subset:
子集:
>>> bool_idx = df.isnull().any(axis=1)
>>> df[bool_idx]
A B C D
0 11.4 1.3 2.0 NaN
1 11.4 1.3 NaN NaN
3 NaN NaN 2.8 0.7
to get integer index:
获取整数索引:
>>> df[bool_idx].index
Int64Index([0, 1, 3], dtype='int64')