pandas 需要对数据框中的负值进行计数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36155942/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Need count of negative values in a dataframe
提问by Sanchit Aluna
I need total count of negative values in a dataframe. i am able to get for an array but unable to find for DataFrame. for array i am using below code can any one suggest me how to get the count for below DataFrame.
我需要数据框中负值的总数。我能够获得一个数组,但无法找到 DataFrame。对于我使用下面代码的数组,任何人都可以建议我如何获取下面 DataFrame 的计数。
sum(n<0 for n in numbers)
Below is my dataframe and expected result is 4
下面是我的数据框,预期结果是 4
a b c d
-3 -2 -1 1
-2 2 3 4
4 5 7 8
采纳答案by bakkal
I am able to get for an array but unable to find for DataFrame
我能够获取一个数组但无法找到 DataFrame
It's possible to flatten the DataFrame to use functions that operation on 1D arrays. So if you're okay with that (likely to be slower than EdChum's answer):
可以将 DataFrame 展平以使用对一维数组进行操作的函数。因此,如果您对此感到满意(可能比 EdChum 的回答慢):
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [-3, -2, 4], 'b': [-2, 2, 5], 'c': [-1, 3, 7], 'd': [1, 4, 8]})
>>> df.values
array([[-3, -2, -1, 1],
[-2, 2, 3, 4],
[ 4, 5, 7, 8]])
>>> df.values.flatten()
array([-3, -2, -1, 1, -2, 2, 3, 4, 4, 5, 7, 8])
>>> sum(n < 0 for n in df.values.flatten())
4
回答by EdChum
You can call .lt
to compare the df against a scalar value and then call sum
twice (this is because it sums row-wise first)
您可以调用.lt
将 df 与标量值进行比较,然后调用sum
两次(这是因为它首先按行求和)
In [66]:
df.lt(0).sum()
Out[66]:
a 2
b 1
c 1
d 0
dtype: int64
Call sum
again to sum the Series
:
sum
再次调用求和Series
:
In [58]:
df.lt(0).sum().sum()
Out[58]:
4
You can also convert the boolean df to a 1-D array and call np.sum
:
您还可以将布尔 df 转换为一维数组并调用np.sum
:
In [62]:
np.sum((df < 0).values.ravel())
Out[62]:
4
Timings
时间安排
For a 30K row df:
对于 30K 行 df:
In [70]:
%timeit sum(n < 0 for n in df.values.flatten())
%timeit df.lt(0).sum().sum()
%timeit np.sum((df < 0).values.ravel())
1 loops, best of 3: 405 ms per loop
100 loops, best of 3: 2.36 ms per loop
1000 loops, best of 3: 770 μs per loop
The np method wins easily here ~525x faster than the loop method and ~4x faster than the pure pandas method
np 方法在这里很容易获胜,比循环方法快 525 倍,比纯 Pandas 方法快 4 倍
回答by Sid
I am using the following. Might not be the best way to go about it.
我正在使用以下内容。可能不是最好的方法。
negatives = len(df.loc[(df.a < 0)]) + len(df.loc[(df.b < 0)] +
len(df.loc[(df.c < 0)] + len(df.loc[(df.d < 0)]
回答by Daniel Reeves
EdChum's solution is very good, but I'd like to add another simple and acceptable solution that uses the pd.DataFrame.agg
method, which is very commonly used and should therefore be easy to remember:
EdChum的解决方案非常好,但我想添加另一个使用该pd.DataFrame.agg
方法的简单且可接受的解决方案,该方法非常常用,因此应该易于记住:
# Set up dataframe
import pandas as pd
df = pd.DataFrame({'a': [-3, -2, 4],
'b': [-2, 2, 5],
'c': [-1, 3, 7],
'd': [1, 4, 8]})
The pd.DataFrame.agg
method to aggregate each row or column (columns by default) into a Series object. Then you can aggregate the series to get a scalar:
将pd.DataFrame.agg
每一行或每一列(默认为列)聚合到一个 Series 对象中的方法。然后您可以聚合该系列以获得标量:
# Count all negative values in a dataframe.
df.agg(lambda x: sum(x < 0)).sum()
Output:
输出:
>>> 4