pandas Python 数据框：达到条件之前的列的累积总和并返回索引

Question

提问by AMaz

I am new in Python and am currently facing an issue I can't solve. I really hope you can help me out. English is not my native languge so I am sorry if I am not able to express myself properly.

我是 Python 新手，目前正面临一个我无法解决的问题。我真的希望你能帮助我。英语不是我的母语，所以如果我不能正确表达自己，我很抱歉。

Say I have a simple data frame with two columns:

假设我有一个包含两列的简单数据框：

index  Num_Albums  Num_authors
0      10          4
1      1           5
2      4           4
3      7           1000
4      1           44
5      3           8

Num_Abums_tot = sum(Num_Albums) = 30

I need to do a cumulative sum of the data in Num_Albumsuntil a certain condition is reached. Register the index at which the condition is achieved and get the correspondent value from Num_authors.

我需要对数据进行累计总和，Num_Albums直到达到某个条件。注册满足条件的索引并从中获取对应的值Num_authors。

Example: cumulative sum of Num_Albumsuntil the sum equals 50% ± 1/15 of 30 (--> 15±2):

示例：累积总和Num_Albums直到总和等于 30 的 50% ± 1/15 (--> 15±2)：

10 = 15±2? No, then continue;
10+1 =15±2? No, then continue
10+1+41 = 15±2? Yes, stop.

Condition reached at index 2. Then get Num_Authorsat that index: Num_Authors(2)=4

在索引 2Num_Authors处达到条件。然后获取该索引：Num_Authors(2)=4

I would like to see if there's a function already implemented in pandas, before I start thinking how to do it with a while/for loop....

pandas在我开始考虑如何使用 while/for 循环来实现之前，我想看看是否已经在中实现了一个函数....

[I would like to specify the column from which I want to retrieve the value at the relevant index (this comes in handy when I have e.g. 4 columns and i want to sum elements in column 1, condition achieved =yes then get the correspondent value in column 2; then do the same with column 3 and 4)].

[我想指定我想从中检索相关索引值的列（当我有 4 列并且我想对第 1 列中的元素求和时，这会派上用场，条件达到 = 是然后获取对应值在第 2 列中；然后对第 3 和 4) 列执行相同的操作。

Answer 1

回答by Nickil Maveli

Opt - 1:

选项 - 1：

You could compute the cumulative sum using cumsum. Then use np.isclosewith it's inbuilt tolerance parameter to check if the values present in this series lies within the specified threshold of 15 +/- 2. This returns a boolean array.

您可以使用计算累积总和cumsum。然后使用np.isclose它的内置容差参数来检查该系列中存在的值是否在 15 +/- 2 的指定阈值内。这将返回一个布尔数组。

Through np.flatnonzero, return the ordinal values of the indices for which the Truecondition holds. We select the first instance of a Truevalue.

通过np.flatnonzero，返回True条件成立的索引的序数值。我们选择一个True值的第一个实例。

Finally, use .ilocto retrieve value of the column name you require based on the index computed earlier.

最后，用于.iloc根据之前计算的索引检索您需要的列名的值。

val = np.flatnonzero(np.isclose(df.Num_Albums.cumsum().values, 15, atol=2))[0]
df['Num_authors'].iloc[val]      # for faster access, use .iat 
4

When performing np.iscloseon the serieslater converted to an array:

当np.isclose在series后来转换为数组时执行：

np.isclose(df.Num_Albums.cumsum().values, 15, atol=2)
array([False, False,  True, False, False, False], dtype=bool)

Opt - 2:

选项 - 2：

Use pd.Index.get_locon the cumsumcalculated series which also supports a toleranceparameter on the nearestmethod.

用于计算序列，pd.Index.get_loc该cumsum序列也支持方法tolerance上的参数nearest。

val = pd.Index(df.Num_Albums.cumsum()).get_loc(15, 'nearest', tolerance=2)
df.get_value(val, 'Num_authors')
4

Opt - 3:

选项 - 3：

Use idxmaxto find the first index of a Truevalue for the boolean mask created after suband absoperations on the cumsumseries:

使用idxmax找到的第一个指标True值后创建的布尔面具sub和abs操作上的cumsum系列：

df.get_value(df.Num_Albums.cumsum().sub(15).abs().le(2).idxmax(), 'Num_authors')
4

Answer 2

回答by Fabio Lamanna

I think you can directly add a column with the cumulative sum as:

我认为您可以直接添加一列累积总和为：

In [3]: df
Out[3]: 
   index  Num_Albums  Num_authors
0      0          10            4
1      1           1            5
2      2           4            4
3      3           7         1000
4      4           1           44
5      5           3            8

In [4]: df['cumsum'] = df['Num_Albums'].cumsum()

In [5]: df
Out[5]: 
   index  Num_Albums  Num_authors  cumsum
0      0          10            4      10
1      1           1            5      11
2      2           4            4      15
3      3           7         1000      22
4      4           1           44      23
5      5           3            8      26

And then apply the condition you want on the cumsumcolumn. For instance you can use whereto get the full row according to the filter. Setting the tolerance tol:

然后在cumsum列上应用您想要的条件。例如，您可以使用where根据过滤器获取整行。设置容差tol：

In [18]: tol = 2

In [19]: cond = df.where((df['cumsum']>=15-tol)&(df['cumsum']<=15+tol)).dropna()

In [20]: cond
Out[20]: 
   index  Num_Albums  Num_authors  cumsum
2    2.0         4.0          4.0    15.0

Answer 3

回答by Rahul Singh

This could even be done as following code:

这甚至可以按照以下代码完成：

def your_function(df):
     sum=0
     index=-1
     for i in df['Num_Albums'].tolist():
       sum+=i
       index+=1
       if sum == ( " your_condition " ):
              return (index,df.loc([df.Num_Albums==i,'Num_authors']))

This would actually return a tuple of your index and the corresponding value of Num_authors as soon as the "your condition" is reached.

一旦达到“您的条件”，这实际上会返回索引的元组和 Num_authors 的相应值。

or could even be returned as an array by

或者甚至可以作为数组返回

def your_function(df):
     sum=0
     index=-1
     for i in df['Num_Albums'].tolist():
       sum+=i
       index+=1
       if sum == ( " your_condition " ):
              return df.loc([df.Num_Albums==i,'Num_authors']).index.values

I am not able to figure out the condition you mentioned of the cumulative sum as when to stop summing so I mentioned it as " your_condition " in the code!!

我无法弄清楚您提到的累积总和的条件作为何时停止求和，因此我在代码中将其称为“ your_condition ”！！

I am also new so hope it helps !!

我也是新手，希望能帮到你！！

pandas Python 数据框：达到条件之前的列的累积总和并返回索引

提问by AMaz

回答by Nickil Maveli

回答by Fabio Lamanna

回答by Rahul Singh

相关推荐

最近更新

标签

pandas Python 数据框：达到条件之前的列的累积总和并返回索引

提问by AMaz

回答by Nickil Maveli

回答by Fabio Lamanna

回答by Rahul Singh

相关推荐

循环遍历不同的 Pandas 数据帧

从 csv 文件读取时，pandas 添加列

pandas 如何获取特定列的值的所有唯一组合

pandas 在 jupyter notebook python 中情节

相关推荐

最近更新

标签