PANDAS 中类似 SQL 的窗口函数：Python Pandas Dataframe 中的行编号

Question

提问by AllenQ

I come from a sql background and I use the following data processing step frequently:

我来自 sql 背景，我经常使用以下数据处理步骤：

Partition the table of data by one or more fields
For each partition, add a rownumber to each of its rows that ranks the row by one or more other fields, where the analyst specifies ascending or descending

按一个或多个字段对数据表进行分区
对于每个分区，向其每一行添加一个行号，该行号按一个或多个其他字段对行进行排名，其中分析师指定升序或降序

EX:

前任：

df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
           'data1' : [1,2,2,3,3],
           'data2' : [1,10,2,3,30]})
df
     data1        data2     key1    
0    1            1         a           
1    2            10        a        
2    2            2         a       
3    3            3         b       
4    3            30        a

I'm looking for how to do the PANDAS equivalent to this sql window function:

我正在寻找如何执行与此 sql 窗口函数等效的 PANDAS：

RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC)


    data1        data2     key1    RN
0    1            1         a       1    
1    2            10        a       2 
2    2            2         a       3
3    3            3         b       1
4    3            30        a       4

I've tried the following which I've gotten to work where there are no 'partitions':

我已经尝试了以下我已经开始工作的地方没有“分区”：

def row_number(frame,orderby_columns, orderby_direction,name):
    frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True)
    frame[name] = list(xrange(len(frame.index)))

I tried to extend this idea to work with partitions (groups in pandas) but the following didn't work:

我试图扩展这个想法来处理分区（pandas 中的组），但以下方法不起作用：

df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index()

def nf(x):
    x['rn'] = list(xrange(len(x.index)))

df1['rn1'] = df1.groupby('key1').apply(nf)

But I just got a lot of NaNs when I do this.

但是当我这样做时，我得到了很多 NaN。

Ideally, there'd be a succinct way to replicate the window function capability of sql (i've figured out the window based aggregates...that's a one liner in pandas)...can someone share with me the most idiomatic way to number rows like this in PANDAS?

理想情况下，会有一种简洁的方法来复制 sql 的窗口函数功能（我已经找到了基于窗口的聚合......这是熊猫中的一个单行）......有人可以与我分享最惯用的方法吗？在 PANDAS 中为这样的行编号？

Answer 1

采纳答案by Andy Hayden

You can do this by using groupbytwice along with the rankmethod:

您可以通过groupby与rank方法一起使用两次来做到这一点：

In [11]: g = df.groupby('key1')

Use the min method argument to give values which share the same data1 the same RN:

使用 min 方法参数为共享相同数据的值提供相同的 RN：

In [12]: g['data1'].rank(method='min')
Out[12]:
0    1
1    2
2    2
3    1
4    4
dtype: float64

In [13]: df['RN'] = g['data1'].rank(method='min')

And then groupby these results and add the rank with respect to data2:

然后将这些结果分组并添加相对于 data2 的排名：

In [14]: g1 = df.groupby(['key1', 'RN'])

In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0    0
1    0
2    1
3    0
4    0
dtype: float64

In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1

In [17]: df
Out[17]:
   data1  data2 key1  RN
0      1      1    a   1
1      2     10    a   2
2      2      2    a   3
3      3      3    b   1
4      3     30    a   4

It feels like there ought to be a native way to do this (there may well be!...).

感觉应该有一种本地方式来做到这一点（很可能有！...）。

Answer 2

回答by HYRY

pandas.lib.fast_zip()can create a tuple array from a list of array. You can use this function to create a tuple series, and then rank it:

pandas.lib.fast_zip()可以从数组列表创建元组数组。您可以使用此函数创建一个元组系列，然后对其进行排名：

values = {'key1' : ['a','a','a','b','a','b'],
          'data1' : [1,2,2,3,3,3],
          'data2' : [1,10,2,3,30,20]}

df = pd.DataFrame(values, index=list("abcdef"))

def rank_multi_columns(df, cols, **kw):
    data = []
    for col in cols:
        if col.startswith("-"):
            flag = -1
            col = col[1:]
        else:
            flag = 1
        data.append(flag*df[col])
    values = pd.lib.fast_zip(data)
    s = pd.Series(values, index=df.index)
    return s.rank(**kw)

rank = df.groupby("key1").apply(lambda df:rank_multi_columns(df, ["data1", "-data2"]))

print rank

the result:

结果：

a    1
b    2
c    3
d    2
e    4
f    1
dtype: float64

Answer 3

回答by MaxU

you can also use sort_values(), groupby()and finally cumcount() + 1:

您还可以使用sort_values(),groupby()最后cumcount() + 1：

df['RN'] = df.sort_values(['data1','data2'], ascending=[True,False]) \
             .groupby(['key1']) \
             .cumcount() + 1
print(df)

yields:

产量：

   data1  data2 key1  RN
0      1      1    a   1
1      2     10    a   2
2      2      2    a   3
3      3      3    b   1
4      3     30    a   4

PS tested with pandas 0.18

PS用熊猫0.18测试

Answer 4

回答by sushmit

You can use transformand Ranktogether Here is an example

你可以一起使用transform和Rank这里是一个例子

df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
           'C2' : [1,2,3,4,5]})
df['Rank'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.rank())
df

Have a look at Pandas Rank method for more information

查看 Pandas Rank 方法以获取更多信息

Answer 5

回答by Gokulakrishnan

Use groupby.rank function. Here the working example.

使用 groupby.rank 函数。这是工作示例。

df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df

C1 C2
a  1
a  2
a  3
b  4
b  5

df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df

C1 C2 RANK
a  1  1
a  2  2
a  3  3
b  4  1
b  5  2

PANDAS 中类似 SQL 的窗口函数：Python Pandas Dataframe 中的行编号

提问by AllenQ

采纳答案by Andy Hayden

回答by HYRY

回答by MaxU

回答by sushmit

回答by Gokulakrishnan

相关推荐

最近更新

标签

PANDAS 中类似 SQL 的窗口函数：Python Pandas Dataframe 中的行编号

提问by AllenQ

采纳答案by Andy Hayden

回答by HYRY

回答by MaxU

回答by sushmit

回答by Gokulakrishnan

相关推荐

Python 是否有一种简洁的方法可以仅显示当前命令的 Pandas 中的所有行？

Python 如何打印没有括号、逗号和引号的整数列表？

Python PySpark 从 TimeStampType 列向 DataFrame 添加一列

Python 简单的 ttk ComboBox 演示

相关推荐

最近更新

标签