如何按“pandas”中的列获取缺失/NaN 数据的汇总计数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22257527/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I get a summary count of missing/NaN data by column in 'pandas'?
提问by orome
In RI can quickly see a count of missing data using the summary
command, but the equivalent pandas
DataFrame method, describe
does not report these values.
在R 中,我可以使用summary
命令快速查看丢失数据的计数,但等效的pandas
DataFrame 方法describe
不会报告这些值。
I gather I can do something like
我想我可以做类似的事情
len(mydata.index) - mydata.count()
to compute the number of missing values for each column, but I wonder if there's a better idiom (or if my approach is even right).
计算每列缺失值的数量,但我想知道是否有更好的习惯用法(或者我的方法是否正确)。
回答by Jeff
Both describe
and info
report the count of non-missing values.
双方describe
并info
上报非缺失值的计数。
In [1]: df = DataFrame(np.random.randn(10,2))
In [2]: df.iloc[3:6,0] = np.nan
In [3]: df
Out[3]:
0 1
0 -0.560342 1.862640
1 -1.237742 0.596384
2 0.603539 -1.561594
3 NaN 3.018954
4 NaN -0.046759
5 NaN 0.480158
6 0.113200 -0.911159
7 0.990895 0.612990
8 0.668534 -0.701769
9 -0.607247 -0.489427
[10 rows x 2 columns]
In [4]: df.describe()
Out[4]:
0 1
count 7.000000 10.000000
mean -0.004166 0.286042
std 0.818586 1.363422
min -1.237742 -1.561594
25% -0.583795 -0.648684
50% 0.113200 0.216699
75% 0.636036 0.608839
max 0.990895 3.018954
[8 rows x 2 columns]
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 2 columns):
0 7 non-null float64
1 10 non-null float64
dtypes: float64(2)
To get a count of missing, your soln is correct
要计算丢失的次数,您的解决方案是正确的
In [20]: len(df.index)-df.count()
Out[20]:
0 3
1 0
dtype: int64
You could do this too
你也可以这样做
In [23]: df.isnull().sum()
Out[23]:
0 3
1 0
dtype: int64
回答by Ricky McMaster
As a tiny addition, to get percentage missing by DataFrame column, combining @Jeff and @userS's answers above gets you:
作为一个小小的补充,要获得 DataFrame 列缺少的百分比,结合上面@Jeff 和@userS 的答案可以得到:
df.isnull().sum()/len(df)*100
回答by userS
This isnt quite a full summary, but it will give you a quick sense of your column level data
这不是一个完整的摘要,但它会让您快速了解您的列级数据
def getPctMissing(series):
num = series.isnull().sum()
den = series.count()
return 100*(num/den)
回答by Drafter250
I can't make comments yet but to add on to Jeff's answer but if you didn't care which columns had Nan's and you just wanted to check overall just add a second .sum() to get a single value.
我还不能发表评论,但要补充 Jeff 的答案,但如果您不关心哪些列有 Nan 并且您只想检查整体,只需添加第二个 .sum() 以获得单个值。
result = df.isnull().sum().sum()
result > 0
a Series would only need one .sum() and a Panel() would need three
一个系列只需要一个 .sum() 而一个 Panel() 需要三个
回答by Kshitij
Following one will do the trick and will return counts of nulls for every column:
下面的一个将解决这个问题,并将返回每一列的空值计数:
df.isnull().sum(axis=0)
df.isnull().sum(axis=0)
df.isnull()
returns a dataframe with True / False values
sum(axis=0)
sums the values across all rows for a column
df.isnull()
返回一个带有 True / False 值的数据框,sum(axis=0)
将一列的所有行的值
相加