PANDAS 中类似 SQL 的窗口函数:Python Pandas Dataframe 中的行编号
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17775935/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
SQL-like window functions in PANDAS: Row Numbering in Python Pandas Dataframe
提问by AllenQ
I come from a sql background and I use the following data processing step frequently:
我来自 sql 背景,我经常使用以下数据处理步骤:
- Partition the table of data by one or more fields
- For each partition, add a rownumber to each of its rows that ranks the row by one or more other fields, where the analyst specifies ascending or descending
- 按一个或多个字段对数据表进行分区
- 对于每个分区,向其每一行添加一个行号,该行号按一个或多个其他字段对行进行排名,其中分析师指定升序或降序
EX:
前任:
df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
'data1' : [1,2,2,3,3],
'data2' : [1,10,2,3,30]})
df
data1 data2 key1
0 1 1 a
1 2 10 a
2 2 2 a
3 3 3 b
4 3 30 a
I'm looking for how to do the PANDAS equivalent to this sql window function:
我正在寻找如何执行与此 sql 窗口函数等效的 PANDAS:
RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC)
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
I've tried the following which I've gotten to work where there are no 'partitions':
我已经尝试了以下我已经开始工作的地方没有“分区”:
def row_number(frame,orderby_columns, orderby_direction,name):
frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True)
frame[name] = list(xrange(len(frame.index)))
I tried to extend this idea to work with partitions (groups in pandas) but the following didn't work:
我试图扩展这个想法来处理分区(pandas 中的组),但以下方法不起作用:
df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index()
def nf(x):
x['rn'] = list(xrange(len(x.index)))
df1['rn1'] = df1.groupby('key1').apply(nf)
But I just got a lot of NaNs when I do this.
但是当我这样做时,我得到了很多 NaN。
Ideally, there'd be a succinct way to replicate the window function capability of sql (i've figured out the window based aggregates...that's a one liner in pandas)...can someone share with me the most idiomatic way to number rows like this in PANDAS?
理想情况下,会有一种简洁的方法来复制 sql 的窗口函数功能(我已经找到了基于窗口的聚合......这是熊猫中的一个单行)......有人可以与我分享最惯用的方法吗?在 PANDAS 中为这样的行编号?
采纳答案by Andy Hayden
You can do this by using groupby
twice along with the rank
method:
您可以通过groupby
与rank
方法一起使用两次来做到这一点:
In [11]: g = df.groupby('key1')
Use the min method argument to give values which share the same data1 the same RN:
使用 min 方法参数为共享相同数据的值提供相同的 RN:
In [12]: g['data1'].rank(method='min')
Out[12]:
0 1
1 2
2 2
3 1
4 4
dtype: float64
In [13]: df['RN'] = g['data1'].rank(method='min')
And then groupby these results and add the rank with respect to data2:
然后将这些结果分组并添加相对于 data2 的排名:
In [14]: g1 = df.groupby(['key1', 'RN'])
In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0 0
1 0
2 1
3 0
4 0
dtype: float64
In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1
In [17]: df
Out[17]:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
It feels like there ought to be a native way to do this (there may well be!...).
感觉应该有一种本地方式来做到这一点(很可能有!...)。
回答by HYRY
pandas.lib.fast_zip()
can create a tuple array from a list of array. You can use this function to create a tuple series, and then rank it:
pandas.lib.fast_zip()
可以从数组列表创建元组数组。您可以使用此函数创建一个元组系列,然后对其进行排名:
values = {'key1' : ['a','a','a','b','a','b'],
'data1' : [1,2,2,3,3,3],
'data2' : [1,10,2,3,30,20]}
df = pd.DataFrame(values, index=list("abcdef"))
def rank_multi_columns(df, cols, **kw):
data = []
for col in cols:
if col.startswith("-"):
flag = -1
col = col[1:]
else:
flag = 1
data.append(flag*df[col])
values = pd.lib.fast_zip(data)
s = pd.Series(values, index=df.index)
return s.rank(**kw)
rank = df.groupby("key1").apply(lambda df:rank_multi_columns(df, ["data1", "-data2"]))
print rank
the result:
结果:
a 1
b 2
c 3
d 2
e 4
f 1
dtype: float64
回答by MaxU
you can also use sort_values()
, groupby()
and finally cumcount() + 1
:
您还可以使用sort_values()
,groupby()
最后cumcount() + 1
:
df['RN'] = df.sort_values(['data1','data2'], ascending=[True,False]) \
.groupby(['key1']) \
.cumcount() + 1
print(df)
yields:
产量:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
PS tested with pandas 0.18
PS用熊猫0.18测试
回答by sushmit
You can use transform
and Rank
together Here is an example
你可以一起使用transform
和Rank
这里是一个例子
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
'C2' : [1,2,3,4,5]})
df['Rank'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.rank())
df
Have a look at Pandas Rank method for more information
查看 Pandas Rank 方法以获取更多信息
回答by Gokulakrishnan
Use groupby.rank function. Here the working example.
使用 groupby.rank 函数。这是工作示例。
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2