Python 如何计算 Pandas DataFrame 中的 nan 值？

Question

提问by SpeedCoder5

What is the best way to account for (not a number) nan values in a pandas DataFrame?

在 Pandas DataFrame 中考虑（不是数字）nan 值的最佳方法是什么？

The following code:

以下代码：

import numpy as np
import pandas as pd
dfd = pd.DataFrame([1, np.nan, 3, 3, 3, np.nan], columns=['a'])
dfv = dfd.a.value_counts().sort_index()
print("nan: %d" % dfv[np.nan].sum())
print("1: %d" % dfv[1].sum())
print("3: %d" % dfv[3].sum())
print("total: %d" % dfv[:].sum())

Outputs:

输出：

nan: 0
1: 1
3: 3
total: 4

While the desired output is:

虽然所需的输出是：

nan: 2
1: 1
3: 3
total: 6

I am using pandas 0.17 with Python 3.5.0 with Anaconda 2.4.0.

我将 Pandas 0.17 与 Python 3.5.0 与 Anaconda 2.4.0 一起使用。

Answer 1

采纳答案by Alex Riley

If you want to count only NaN values in column 'a'of a DataFrame df, use:

如果您只想计算'a'DataFrame列中的NaN 值df，请使用：

len(df) - df['a'].count()

Here count()tells us the number of non-NaN values, and this is subtracted from the total number of values (given by len(df)).

这里count()告诉我们非 NaN 值的数量，这是从值的总数中减去（由给出len(df)）。

To count NaN values in everycolumn of df, use:

要计算的每一列中的NaN 值df，请使用：

len(df) - df.count()

If you want to use value_counts, tell it notto drop NaN values by setting dropna=False(added in 0.14.1):

如果要使用value_counts，请通过设置（在0.14.1 中添加）告诉它不要删除 NaN 值：dropna=False

dfv = dfd['a'].value_counts(dropna=False)

This allows the missing values in the column to be counted too:

这也允许计算列中的缺失值：

 3     3
NaN    2
 1     1
Name: a, dtype: int64

The rest of your code should then work as you expect (note that it's not necessary to call sum; just print("nan: %d" % dfv[np.nan])suffices).

然后您的其余代码应该按您的预期工作（请注意，没有必要调用sum; 就print("nan: %d" % dfv[np.nan])足够了）。

Answer 2

回答by ilyas patanam

To count just null values, you can use isnull():

要仅计算空值，您可以使用isnull()：

In [11]:
dfd.isnull().sum()

Out[11]:
a    2
dtype: int64

Here ais the column name, and there are 2 occurrences of the null value in the column.

这a是列名，列中出现了 2 次空值。

Answer 3

回答by Thom Ives

A good clean way to count all NaN's in all columns of your dataframe would be ...

计算数据帧所有列中所有 NaN 的好方法是......

import pandas as pd 
import numpy as np


df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
print(df.isna().sum().sum())

Using a single sum, you get the count of NaN's for each column. The second sum, sums those column sums.

使用单个总和，您可以获得每列的 NaN 计数。第二个总和，将这些列总和相加。

Answer 4

回答by shuishoudage

if you only want the summary of null value for each column, using the following code df.isnull().sum()if you want to know how many null values in the data frame using following code df.isnull().sum().sum() # calculate total

如果您只想要每列的空值摘要，请使用以下代码df.isnull().sum()如果您想使用以下代码知道数据框中有多少空值 df.isnull().sum().sum() # calculate total

Answer 5

回答by Mr_and_Mrs_D

Yet another way to count allthe nans in a df:

另一种计算df 中所有nan 的方法：

num_nans = df.size - df.count().sum()

Timings:

时间：

import timeit

import numpy as np
import pandas as pd

df_scale = 100000
df = pd.DataFrame(
    [[1, np.nan, 100, 63], [2, np.nan, 101, 63], [2, 12, 102, 63],
     [2, 14, 102, 63], [2, 14, 102, 64], [1, np.nan, 200, 63]] * df_scale,
    columns=['group', 'value', 'value2', 'dummy'])

repeat = 3
numbers = 100

setup = """import pandas as pd
from __main__ import df
"""

def timer(statement, _setup=None):
    print (min(
        timeit.Timer(statement, setup=_setup or setup).repeat(
            repeat, numbers)))

timer('df.size - df.count().sum()')
timer('df.isna().sum().sum()')
timer('df.isnull().sum().sum()')

prints:

印刷：

3.998805362999999
3.7503365439999996
3.689461442999999

so pretty much equivalent

非常等价

Python 如何计算 Pandas DataFrame 中的 nan 值？

提问by SpeedCoder5

采纳答案by Alex Riley

回答by ilyas patanam

回答by Thom Ives

回答by shuishoudage

回答by Mr_and_Mrs_D

相关推荐

最近更新

标签

Python 如何计算 Pandas DataFrame 中的 nan 值？

提问by SpeedCoder5

采纳答案by Alex Riley

回答by ilyas patanam

回答by Thom Ives

回答by shuishoudage

回答by Mr_and_Mrs_D

相关推荐

Python Pelican 3.3 鹈鹕快速入门错误“ValueError：未知语言环境：UTF-8”

Python 如何将协程添加到正在运行的 asyncio 循环中？

Python 从 (row,col,values) 的元组列表构造 pandas DataFrame

Python DataFrame 对象没有属性“sort_values”

相关推荐

最近更新

标签