pandas - 获取由另一列索引的特定列的最新值(获取由另一列索引的特定列的最大值)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9850954/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas - get most recent value of a particular column indexed by another column (get maximum value of a particular column indexed by another column)
提问by enrishi
I have the following dataframe:
我有以下数据框:
obj_id data_date value
0 4 2011-11-01 59500
1 2 2011-10-01 35200
2 4 2010-07-31 24860
3 1 2009-07-28 15860
4 2 2008-10-15 200200
I want to get a subset of this data so that I only have the most recent (largest 'data_date') 'value'for each 'obj_id'.
我想这个数据的一个子集,这样我只有最近的(最大'data_date')'value'每个'obj_id'。
I've hacked together a solution, but it feels dirty. I was wondering if anyone has a better way. I'm sure I must be missing some easy way to do it through pandas.
我已经破解了一个解决方案,但感觉很脏。我想知道是否有人有更好的方法。我确定我一定错过了一些通过 Pandas 完成的简单方法。
My method is essentially to group, sort, retrieve, and recombine as follows:
我的方法本质上是按如下方式进行分组、排序、检索和重组:
row_arr = []
for grp, grp_df in df.groupby('obj_id'):
row_arr.append(dfg.sort('data_date', ascending = False)[:1].values[0])
df_new = DataFrame(row_arr, columns = ('obj_id', 'data_date', 'value'))
回答by pdifranc
This is another possible solution. I believe it's is the fastest.
这是另一种可能的解决方案。我相信它是最快的。
df.loc[df.groupby('obj_id').data_date.idxmax(),:]
回答by thetainted1
If the number of "obj_id"s is very high you'll want to sort the entire dataframe and then drop duplicates to get the last element.
如果“obj_id”的数量非常多,您需要对整个数据框进行排序,然后删除重复项以获取最后一个元素。
sorted = df.sort_index(by='data_date')
result = sorted.drop_duplicates('obj_id', keep='last').values
This should be faster (sorry I didn't test it) because you don't have to do a custom agg function, which is slow when there is a large number of keys. You might think it's worse to sort the entire dataframe, but in practice in python sorts are fast and native loops are slow.
这应该更快(抱歉我没有测试),因为您不必执行自定义 agg 功能,当有大量键时会很慢。您可能认为对整个数据帧进行排序更糟糕,但实际上在 Python 中排序很快,而本机循环很慢。
回答by Maximilian
I like crewbum's answer, probably this is faster (sorry, didn't tested this yet, but i avoid sorting everything):
我喜欢crewbum的回答,可能这更快(抱歉,尚未对此进行测试,但我避免对所有内容进行排序):
df.groupby('obj_id').agg(lambda df: df.values[df['data_date'].values.argmax()])
it uses numpys "argmax" function to find the rowindex in which the maximum appears.
它使用 numpys "argmax" 函数来查找出现最大值的行索引。
回答by Tamelise
Updating thetainted1's answer since some of the functions have future warnings now as tommy.carstensen pointed out. Here's what worked for me:
正如 tommy.carstensen 指出的那样,更新 thetainted1 的答案,因为某些功能现在有未来的警告。这是对我有用的:
sorted = df.sort_values(by='data_date')
result = sorted.drop_duplicates('obj_id', keep='last')
回答by Garrett
The aggregate() methodon groupby objects can be used to create a new DataFrame from a groupby object in a single step. (I'm not aware of a cleaner way to extract the first/last row of a DataFrame though.)
groupby 对象上的aggregate() 方法可用于在一个步骤中从 groupby 对象创建一个新的 DataFrame。(不过,我不知道有一种更简洁的方法可以提取 DataFrame 的第一行/最后一行。)
In [12]: df.groupby('obj_id').agg(lambda df: df.sort('data_date')[-1:].values[0])
Out[12]:
data_date value
obj_id
1 2009-07-28 15860
2 2011-10-01 35200
4 2011-11-01 59500
You can also perform aggregation on individual columns, in which case the aggregate function works on a Series object.
您还可以对单个列执行聚合,在这种情况下,聚合函数适用于 Series 对象。
In [25]: df.groupby('obj_id')['value'].agg({'diff': lambda s: s.max() - s.min()})
Out[25]:
diff
obj_id
1 0
2 165000
4 34640
回答by Zihs
I believe to have found a more appropriate solution based off the ones in this thread. However mine uses the apply function of a dataframe instead of the aggregate. It also returns a new dataframe with the same columns as the original.
我相信已经根据该线程中的解决方案找到了更合适的解决方案。然而,我的使用数据帧的应用函数而不是聚合函数。它还返回一个与原始数据帧具有相同列的新数据帧。
df = pd.DataFrame({
'CARD_NO': ['000', '001', '002', '002', '001', '111'],
'DATE': ['2006-12-31 20:11:39','2006-12-27 20:11:53','2006-12-28 20:12:11','2006-12-28 20:12:13','2008-12-27 20:11:53','2006-12-30 20:11:39']})
print df
df.groupby('CARD_NO').apply(lambda df:df['DATE'].values[df['DATE'].values.argmax()])
Original
原来的
CARD_NO DATE
0 000 2006-12-31 20:11:39
1 001 2006-12-27 20:11:53
2 002 2006-12-28 20:12:11
3 002 2006-12-28 20:12:13
4 001 2008-12-27 20:11:53
5 111 2006-12-30 20:11:39
Returned dataframe:
返回的数据帧:
CARD_NO
000 2006-12-31 20:11:39
001 2008-12-27 20:11:53
002 2006-12-28 20:12:13
111 2006-12-30 20:11:39

