Pandas - 按连续范围分组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36835793/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:07:04  来源:igfitidea点击:

Pandas - group by consecutive ranges

pythonpandasgroup-byintervals

提问by Moshe Einhorn

I have a dataframe with the following structure - Start, End and Height.

我有一个具有以下结构的数据框 - 开始、结束和高度。

Some properties of the dataframe:

数据框的一些属性:

  • A row in the dataframe always starts from where the previous row ended i.e. if the end for row n is 100 then the start of line n+1 is 101.
  • The height of row n+1 is always different then the height in row n+1 (this is the reason the data is in different rows).
  • 数据帧中的一行总是从前一行结束的地方开始,即如果第 n 行的结尾是 100,那么第 n+1 行的开头是 101。
  • 第 n+1 行的高度总是与第 n+1 行的高度不同(这就是数据在不同行中的原因)。

I'd like to group the dataframe in a way that heights will be grouped in buckets of 5 longs i.e. the buckets are 0, 1-5, 6-10, 11-15 and >15.

我想以一种方式对数据框进行分组,即高度将分组在 5 个长的桶中即桶是0, 1-5, 6-10, 11-15 和 >15

See code example below where what I'm looking for is the implemetation of group_by_bucketfunction.

请参阅下面的代码示例,其中我正在寻找的是group_by_bucket函数的实现。

I tried looking at other questions but couldn't get exact answer to what I was looking for.

我尝试查看其他问题,但无法得到我正在寻找的确切答案。

Thanks in advance!

提前致谢!

>>> d = pd.DataFrame([[1,3,5], [4,10,7], [11,17,6], [18,26, 12], [27,30, 15], [31,40,6], [41, 42, 7]], columns=['start','end', 'height'])
>>> d
   start  end  height
0      1    3       8
1      4   10       7
2     11   17       6
3     18   26      12
4     27   30      15
5     31   40       6
6     41   42       7
>>> d_gb = group_by_bucket(d)
>>> d_gb
   start  end height_grouped
0      1   17           6_10
1     18   30          11_15
2     31   42           6_10

采纳答案by B. M.

A way to do that :

一种方法:

df = pd.DataFrame([[1,3,10], [4,10,7], [11,17,6], [18,26, 12],
[27,30, 15], [31,40,6], [41, 42, 6]], columns=['start','end', 'height'])

Use cutto make groups :

使用cut在组:

df['groups']=pd.cut(df.height,[-1,0,5,10,15,1000])

Find break points :

找到断点:

df['categories']=(df.groups!=df.groups.shift()).cumsum()

Then dfis :

然后df是:

"""
   start  end  height    groups  categories
0      1    3      10   (5, 10]           0
1      4   10       7   (5, 10]           0
2     11   17       6   (5, 10]           0
3     18   26      12  (10, 15]           1
4     27   30      15  (10, 15]           1
5     31   40       6   (5, 10]           2
6     41   42       6   (5, 10]           2
"""

Define interesting data :

定义有趣的数据:

f = {'start':['first'],'end':['last'], 'groups':['first']}

And use the groupby.aggfunction :

并使用该groupby.agg功能:

df.groupby('categories').agg(f)
"""
              groups  end start
               first last first
categories                     
0            (5, 10]   17     1
1           (10, 15]   30    18
2            (5, 10]   42    31
"""

回答by jezrael

You can use cutwith groupbyby cutand Serieswith cumsumfor generating groups and aggregate by agg, firstand last:

您可以使用cutwith groupbybycutSerieswithcumsum来生成组和聚合 by agg, firstand last

bins = [-1,0,1,5,10,15,100]
print bins
[-1, 0, 1, 5, 10, 15, 100]

cut_ser = pd.cut(d['height'], bins=bins)
print cut_ser
0     (5, 10]
1     (5, 10]
2     (5, 10]
3    (10, 15]
4    (10, 15]
5     (5, 10]
6     (5, 10]
Name: height, dtype: category
Categories (6, object): [(-1, 0] < (0, 1] < (1, 5] < (5, 10] < (10, 15] < (15, 100]]

print (cut_ser.shift() != cut_ser).cumsum()
0    0
1    0
2    0
3    1
4    1
5    2
6    2
Name: height, dtype: int32

print d.groupby([(cut_ser.shift() != cut_ser).cumsum(), cut_ser])
       .agg({'start' : 'first','end' : 'last'})
       .reset_index(level=1).reset_index(drop=True)
       .rename(columns={'height':'height_grouped'})

  height_grouped  start  end
0        (5, 10]      1   17
1       (10, 15]     18   30
2        (5, 10]     31   42

EDIT:

编辑:

Timings:

时间

In [307]: %timeit a(df)
100 loops, best of 3: 5.45 ms per loop

In [308]: %timeit b(d)
The slowest run took 4.45 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 3.28 ms per loop

Code:

代码

d = pd.DataFrame([[1,3,5], [4,10,7], [11,17,6], [18,26, 12], [27,30, 15], [31,40,6], [41, 42, 7]], columns=['start','end', 'height'])
print d

df = d.copy()


def a(df):
    df['groups']=pd.cut(df.height,[-1,0,5,10,15,1000])
    df['categories']=(df.groups!=df.groups.shift()).cumsum()
    f = {'start':['first'],'end':['last'], 'groups':['first']}
    return df.groupby('categories').agg(f)

def b(d):
    bins = [-1,0,1,5,10,15,100]
    cut_ser = pd.cut(d['height'], bins=bins)
    return d.groupby([(cut_ser.shift() != cut_ser).cumsum(), cut_ser]).agg({'start' : 'first','end' : 'last'}).reset_index(level=1).reset_index(drop=True).rename(columns={'height':'height_grouped'})


print a(df)    
print b(d)