pandas Python 数据框:达到条件之前的列的累积总和并返回索引
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41488676/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Data Frame: cumulative sum of column until condition is reached and return the index
提问by AMaz
I am new in Python and am currently facing an issue I can't solve. I really hope you can help me out. English is not my native languge so I am sorry if I am not able to express myself properly.
我是 Python 新手,目前正面临一个我无法解决的问题。我真的希望你能帮助我。英语不是我的母语,所以如果我不能正确表达自己,我很抱歉。
Say I have a simple data frame with two columns:
假设我有一个包含两列的简单数据框:
index Num_Albums Num_authors
0 10 4
1 1 5
2 4 4
3 7 1000
4 1 44
5 3 8
Num_Abums_tot = sum(Num_Albums) = 30
I need to do a cumulative sum of the data in Num_Albums
until a certain condition is reached. Register the index at which the condition is achieved and get the correspondent value from Num_authors
.
我需要对数据进行累计总和,Num_Albums
直到达到某个条件。注册满足条件的索引并从 中获取对应的值Num_authors
。
Example:
cumulative sum of Num_Albums
until the sum equals 50% ± 1/15 of 30 (--> 15±2):
示例:累积总和Num_Albums
直到总和等于 30 的 50% ± 1/15 (--> 15±2):
10 = 15±2? No, then continue;
10+1 =15±2? No, then continue
10+1+41 = 15±2? Yes, stop.
Condition reached at index 2. Then get Num_Authors
at that index: Num_Authors(2)=4
在索引 2Num_Authors
处达到条件。然后获取该索引:Num_Authors(2)=4
I would like to see if there's a function already implemented in pandas
, before I start thinking how to do it with a while/for loop....
pandas
在我开始考虑如何使用 while/for 循环来实现之前,我想看看是否已经在 中实现了一个函数....
[I would like to specify the column from which I want to retrieve the value at the relevant index (this comes in handy when I have e.g. 4 columns and i want to sum elements in column 1, condition achieved =yes then get the correspondent value in column 2; then do the same with column 3 and 4)].
[我想指定我想从中检索相关索引值的列(当我有 4 列并且我想对第 1 列中的元素求和时,这会派上用场,条件达到 = 是然后获取对应值在第 2 列中;然后对第 3 和 4) 列执行相同的操作。
回答by Nickil Maveli
Opt - 1:
选项 - 1:
You could compute the cumulative sum using cumsum
. Then use np.isclose
with it's inbuilt tolerance parameter to check if the values present in this series lies within the specified threshold of 15 +/- 2. This returns a boolean array.
您可以使用 计算累积总和cumsum
。然后使用np.isclose
它的内置容差参数来检查该系列中存在的值是否在 15 +/- 2 的指定阈值内。这将返回一个布尔数组。
Through np.flatnonzero
, return the ordinal values of the indices for which the True
condition holds. We select the first instance of a True
value.
通过np.flatnonzero
,返回True
条件成立的索引的序数值。我们选择一个True
值的第一个实例。
Finally, use .iloc
to retrieve value of the column name you require based on the index computed earlier.
最后,用于.iloc
根据之前计算的索引检索您需要的列名的值。
val = np.flatnonzero(np.isclose(df.Num_Albums.cumsum().values, 15, atol=2))[0]
df['Num_authors'].iloc[val] # for faster access, use .iat
4
When performing np.isclose
on the series
later converted to an array:
当np.isclose
在series
后来转换为数组时执行:
np.isclose(df.Num_Albums.cumsum().values, 15, atol=2)
array([False, False, True, False, False, False], dtype=bool)
Opt - 2:
选项 - 2:
Use pd.Index.get_loc
on the cumsum
calculated series which also supports a tolerance
parameter on the nearest
method.
用于计算序列,pd.Index.get_loc
该cumsum
序列也支持方法tolerance
上的参数nearest
。
val = pd.Index(df.Num_Albums.cumsum()).get_loc(15, 'nearest', tolerance=2)
df.get_value(val, 'Num_authors')
4
Opt - 3:
选项 - 3:
Use idxmax
to find the first index of a True
value for the boolean mask created after sub
and abs
operations on the cumsum
series:
使用idxmax
找到的第一个指标True
值后创建的布尔面具sub
和abs
操作上的cumsum
系列:
df.get_value(df.Num_Albums.cumsum().sub(15).abs().le(2).idxmax(), 'Num_authors')
4
回答by Fabio Lamanna
I think you can directly add a column with the cumulative sum as:
我认为您可以直接添加一列累积总和为:
In [3]: df
Out[3]:
index Num_Albums Num_authors
0 0 10 4
1 1 1 5
2 2 4 4
3 3 7 1000
4 4 1 44
5 5 3 8
In [4]: df['cumsum'] = df['Num_Albums'].cumsum()
In [5]: df
Out[5]:
index Num_Albums Num_authors cumsum
0 0 10 4 10
1 1 1 5 11
2 2 4 4 15
3 3 7 1000 22
4 4 1 44 23
5 5 3 8 26
And then apply the condition you want on the cumsum
column. For instance you can use where
to get the full row according to the filter. Setting the tolerance tol
:
然后在cumsum
列上应用您想要的条件。例如,您可以使用where
根据过滤器获取整行。设置容差tol
:
In [18]: tol = 2
In [19]: cond = df.where((df['cumsum']>=15-tol)&(df['cumsum']<=15+tol)).dropna()
In [20]: cond
Out[20]:
index Num_Albums Num_authors cumsum
2 2.0 4.0 4.0 15.0
回答by Rahul Singh
This could even be done as following code:
这甚至可以按照以下代码完成:
def your_function(df):
sum=0
index=-1
for i in df['Num_Albums'].tolist():
sum+=i
index+=1
if sum == ( " your_condition " ):
return (index,df.loc([df.Num_Albums==i,'Num_authors']))
This would actually return a tuple of your index and the corresponding value of Num_authors as soon as the "your condition" is reached.
一旦达到“您的条件”,这实际上会返回索引的元组和 Num_authors 的相应值。
or could even be returned as an array by
或者甚至可以作为数组返回
def your_function(df):
sum=0
index=-1
for i in df['Num_Albums'].tolist():
sum+=i
index+=1
if sum == ( " your_condition " ):
return df.loc([df.Num_Albums==i,'Num_authors']).index.values
I am not able to figure out the condition you mentioned of the cumulative sum as when to stop summing so I mentioned it as " your_condition " in the code!!
我无法弄清楚您提到的累积总和的条件作为何时停止求和,因此我在代码中将其称为“ your_condition ”!!
I am also new so hope it helps !!
我也是新手,希望能帮到你!!