Python 计算pandas DataFrame中缺失值的行数的最佳方法

Question

提问by

I currently came up with some work arounds to count the number of missing values in a pandas DataFrame. Those are quite ugly and I am wondering if there is a better way to do it.

我目前想出了一些解决方法来计算 pandas 中缺失值的数量DataFrame。这些很丑陋，我想知道是否有更好的方法来做到这一点。

Let's create an example DataFrame:

让我们创建一个例子DataFrame：

from numpy.random import randn
df = pd.DataFrame(randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
               columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

enter image description here

在此处输入图片说明

What I currently have is

我目前拥有的是

a) Counting cells with missing values:

a) 计数缺失值的单元格：

>>> sum(df.isnull().values.ravel())
9

b) Counting rows that have missing values somewhere:

b) 计算某处缺失值的行：

>>> sum([True for idx,row in df.iterrows() if any(row.isnull())])
3

Answer 1

采纳答案by EdChum

For the second count I think just subtract the number of rows from the number of rows returned from dropna:

对于第二个计数，我认为只需从从返回的行数中减去行数dropna：

In [14]:

from numpy.random import randn
df = pd.DataFrame(randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
               columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df
Out[14]:
        one       two     three
a -0.209453 -0.881878  3.146375
b       NaN       NaN       NaN
c  0.049383 -0.698410 -0.482013
d       NaN       NaN       NaN
e -0.140198 -1.285411  0.547451
f -0.219877  0.022055 -2.116037
g       NaN       NaN       NaN
h -0.224695 -0.025628 -0.703680
In [18]:

df.shape[0] - df.dropna().shape[0]
Out[18]:
3

The first could be achieved using the built in methods:

第一个可以使用内置方法实现：

In [30]:

df.isnull().values.ravel().sum()
Out[30]:
9

Timings

时间安排

In [34]:

%timeit sum([True for idx,row in df.iterrows() if any(row.isnull())])
%timeit df.shape[0] - df.dropna().shape[0]
%timeit sum(map(any, df.apply(pd.isnull)))
1000 loops, best of 3: 1.55 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.82 ms per loop
In [33]:

%timeit sum(df.isnull().values.ravel())
%timeit df.isnull().values.ravel().sum()
%timeit df.isnull().sum().sum()
1000 loops, best of 3: 215 μs per loop
1000 loops, best of 3: 210 μs per loop
1000 loops, best of 3: 605 μs per loop

So my alternatives are a little faster for a df of this size

所以我的替代方案对于这种大小的 df 来说要快一点

Update

更新

So for a df with 80,000 rows I get the following:

因此，对于具有 80,000 行的 df，我得到以下信息：

In [39]:

%timeit sum([True for idx,row in df.iterrows() if any(row.isnull())])
%timeit df.shape[0] - df.dropna().shape[0]
%timeit sum(map(any, df.apply(pd.isnull)))
%timeit np.count_nonzero(df.isnull())
1 loops, best of 3: 9.33 s per loop
100 loops, best of 3: 6.61 ms per loop
100 loops, best of 3: 3.84 ms per loop
1000 loops, best of 3: 395 μs per loop
In [40]:

%timeit sum(df.isnull().values.ravel())
%timeit df.isnull().values.ravel().sum()
%timeit df.isnull().sum().sum()
%timeit np.count_nonzero(df.isnull().values.ravel())
1000 loops, best of 3: 675 μs per loop
1000 loops, best of 3: 679 μs per loop
100 loops, best of 3: 6.56 ms per loop
1000 loops, best of 3: 368 μs per loop

Actually np.count_nonzerowins this hands down.

实际上np.count_nonzero赢得了这一手。

Answer 2

回答by ely

Total missing:

总失踪：

df.isnull().sum().sum()

Rows with missing:

缺少的行：

sum(map(any, df.isnull()))

Answer 3

回答by Paul Jtheitroademan

What about numpy.count_nonzero:

怎么样numpy.count_nonzero：

 np.count_nonzero(df.isnull().values)   
 np.count_nonzero(df.isnull())           # also works

count_nonzerois pretty quick. However, I constructed a dataframe from a (1000,1000) array and randomly inserted 100 nan values at different positions and measured the times of the various answers in iPython:

count_nonzero很快。但是，我从一个 (1000,1000) 数组构建了一个数据框，并在不同位置随机插入了 100 个 nan 值，并测量了 iPython 中各种答案的时间：

%timeit np.count_nonzero(df.isnull().values)
1000 loops, best of 3: 1.89 ms per loop

%timeit df.isnull().values.ravel().sum()
100 loops, best of 3: 3.15 ms per loop

%timeit df.isnull().sum().sum()
100 loops, best of 3: 15.7 ms per loop

Not a huge time improvement over the OPs original but possibly less confusing in the code, your decision. There isn't really any difference in execution time between the two count_nonzeromethods (with and without .values).

与原始 OP 相比，时间改进不大，但代码中的混乱程度可能会降低，您的决定。这两种count_nonzero方法（有和没有.values）之间的执行时间实际上没有任何区别。

Answer 4

回答by Alvaro Fuentes

A simple approach to counting the missing values in the rows or in the columns

计算行或列中缺失值的简单方法

df.apply(lambda x: sum(x.isnull().values), axis = 0) # For columns
df.apply(lambda x: sum(x.isnull().values), axis = 1) # For rows

Number of rows with at least one missing value:

至少有一个缺失值的行数：

sum(df.apply(lambda x: sum(x.isnull().values), axis = 1)>0)

Answer 5

回答by W.P. McNeill

sum(df.count(axis=1) < len(df.columns)), the number of rows that have fewer non-nulls than columns.

sum(df.count(axis=1) < len(df.columns))，非空值少于列的行数。

For example, the following data frame has two rows with missing values.

例如，以下数据框有两行缺失值。

>>> df = pd.DataFrame({"a":[1, None, 3], "b":[4, 5, None]})
>>> df
    a   b
0   1   4
1 NaN   5
2   3 NaN
>>> df.count(axis=1)
0    2
1    1
2    1
dtype: int64
>>> df.count(axis=1) < len(df.columns)
0    False
1     True
2     True
dtype: bool
>>> sum(df.count(axis=1) < len(df.columns))
2

Answer 6

回答by ConanG

So many wrong answers here. OP asked for number of rows with null values, not columns.

这里有很多错误的答案。OP 要求具有空值的行数，而不是列数。

Here is a better example:

这是一个更好的例子：

from numpy.random import randn
df = pd.DataFrame(randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one','two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h','asdf'])
print(df)

`Now there is obviously 4 rows with null values.

`现在显然有 4 行具有空值。

           one       two     three
a    -0.571617  0.952227  0.030825
b          NaN       NaN       NaN
c     0.627611 -0.462141  1.047515
d          NaN       NaN       NaN
e     0.043763  1.351700  1.480442
f     0.630803  0.931862  1.500602
g          NaN       NaN       NaN
h     0.729103 -1.198237 -0.207602
asdf       NaN       NaN       NaN

You would get answer as 3 (number of columns with NaNs) if you used some of the answers here. Fuentes' answer works.

如果您在这里使用了一些答案，您将得到 3（包含 NaN 的列数）的答案。富恩特斯的回答有效。

Here is how I got it:

这是我得到它的方法：

df.isnull().any(axis=1).sum()
#4
timeit df.isnull().any(axis=1).sum()
#10000 loops, best of 3: 193 μs per loop

'Fuentes':

'富恩特斯'：

sum(df.apply(lambda x: sum(x.isnull().values), axis = 1)>0)
#4
timeit sum(df.apply(lambda x: sum(x.isnull().values), axis = 1)>0)
#1000 loops, best of 3: 677 μs per loop

Answer 7

回答by ruining.z

I think if you just wanna take a look the result, there is a pandas func pandas.DataFrame.count.

我想如果你只是想看看结果，有一个 pandas func pandas.DataFrame.count。

So back to this topic, using df.count(axis=1), and u will get the result like this:

所以回到这个话题，使用df.count(axis=1), 你会得到这样的结果：

a    3
b    0
c    3
d    0
e    3
f    3
g    0
h    3
dtype: int64

It will tell you how many non-NaN parameters in each row. Meanwhile, -(df.count(axis=1) - df.shape[1])indicates

它会告诉你每行有多少非 NaN 参数。同时， -(df.count(axis=1) - df.shape[1])表示

a    0
b    3
c    0
d    3
e    0
f    0
g    3
h    0
dtype: int64

Python 计算pandas DataFrame中缺失值的行数的最佳方法

提问by

采纳答案by EdChum

回答by ely

回答by Paul Jtheitroademan

回答by Alvaro Fuentes

回答by W.P. McNeill

回答by ConanG

回答by ruining.z

相关推荐

最近更新

标签

Python 计算pandas DataFrame中缺失值的行数的最佳方法

提问by

采纳答案by EdChum

回答by ely

回答by Paul Jtheitroademan

回答by Alvaro Fuentes

回答by W.P. McNeill

回答by ConanG

回答by ruining.z

相关推荐

使用 Python/Flask 将 html 转换为 pdf

Python 如何将终端的输出写入文件

从requirements.txt安装python pip麻烦

Python AWS aws.push ImportError：Ubuntu 中没有名为 boto 的模块

相关推荐

最近更新

标签