pandas - 获取由另一列索引的特定列的最新值（获取由另一列索引的特定列的最大值）

Question

提问by enrishi

I have the following dataframe:

我有以下数据框：

   obj_id   data_date   value
0  4        2011-11-01  59500    
1  2        2011-10-01  35200 
2  4        2010-07-31  24860   
3  1        2009-07-28  15860
4  2        2008-10-15  200200

I want to get a subset of this data so that I only have the most recent (largest 'data_date') 'value'for each 'obj_id'.

我想这个数据的一个子集，这样我只有最近的（最大'data_date'）'value'每个'obj_id'。

I've hacked together a solution, but it feels dirty. I was wondering if anyone has a better way. I'm sure I must be missing some easy way to do it through pandas.

我已经破解了一个解决方案，但感觉很脏。我想知道是否有人有更好的方法。我确定我一定错过了一些通过 Pandas 完成的简单方法。

My method is essentially to group, sort, retrieve, and recombine as follows:

我的方法本质上是按如下方式进行分组、排序、检索和重组：

row_arr = []
for grp, grp_df in df.groupby('obj_id'):
    row_arr.append(dfg.sort('data_date', ascending = False)[:1].values[0])

df_new = DataFrame(row_arr, columns = ('obj_id', 'data_date', 'value'))

Answer 1

回答by pdifranc

This is another possible solution. I believe it's is the fastest.

这是另一种可能的解决方案。我相信它是最快的。

df.loc[df.groupby('obj_id').data_date.idxmax(),:]

Answer 2

回答by thetainted1

If the number of "obj_id"s is very high you'll want to sort the entire dataframe and then drop duplicates to get the last element.

如果“obj_id”的数量非常多，您需要对整个数据框进行排序，然后删除重复项以获取最后一个元素。

sorted = df.sort_index(by='data_date')
result = sorted.drop_duplicates('obj_id', keep='last').values

This should be faster (sorry I didn't test it) because you don't have to do a custom agg function, which is slow when there is a large number of keys. You might think it's worse to sort the entire dataframe, but in practice in python sorts are fast and native loops are slow.

这应该更快（抱歉我没有测试），因为您不必执行自定义 agg 功能，当有大量键时会很慢。您可能认为对整个数据帧进行排序更糟糕，但实际上在 Python 中排序很快，而本机循环很慢。

Answer 3

回答by Maximilian

I like crewbum's answer, probably this is faster (sorry, didn't tested this yet, but i avoid sorting everything):

我喜欢crewbum的回答，可能这更快（抱歉，尚未对此进行测试，但我避免对所有内容进行排序）：

df.groupby('obj_id').agg(lambda df: df.values[df['data_date'].values.argmax()])

it uses numpys "argmax" function to find the rowindex in which the maximum appears.

它使用 numpys "argmax" 函数来查找出现最大值的行索引。

Answer 4

回答by Tamelise

Updating thetainted1's answer since some of the functions have future warnings now as tommy.carstensen pointed out. Here's what worked for me:

正如 tommy.carstensen 指出的那样，更新 thetainted1 的答案，因为某些功能现在有未来的警告。这是对我有用的：

sorted = df.sort_values(by='data_date')

result = sorted.drop_duplicates('obj_id', keep='last')

Answer 5

回答by Garrett

The aggregate() methodon groupby objects can be used to create a new DataFrame from a groupby object in a single step. (I'm not aware of a cleaner way to extract the first/last row of a DataFrame though.)

groupby 对象上的aggregate() 方法可用于在一个步骤中从 groupby 对象创建一个新的 DataFrame。（不过，我不知道有一种更简洁的方法可以提取 DataFrame 的第一行/最后一行。）

In [12]: df.groupby('obj_id').agg(lambda df: df.sort('data_date')[-1:].values[0])
Out[12]: 
         data_date  value
obj_id                   
1       2009-07-28  15860
2       2011-10-01  35200
4       2011-11-01  59500

You can also perform aggregation on individual columns, in which case the aggregate function works on a Series object.

您还可以对单个列执行聚合，在这种情况下，聚合函数适用于 Series 对象。

In [25]: df.groupby('obj_id')['value'].agg({'diff': lambda s: s.max() - s.min()})
Out[25]: 
          diff
obj_id        
1            0
2       165000
4        34640

Answer 6

回答by Zihs

I believe to have found a more appropriate solution based off the ones in this thread. However mine uses the apply function of a dataframe instead of the aggregate. It also returns a new dataframe with the same columns as the original.

我相信已经根据该线程中的解决方案找到了更合适的解决方案。然而，我的使用数据帧的应用函数而不是聚合函数。它还返回一个与原始数据帧具有相同列的新数据帧。

df = pd.DataFrame({
'CARD_NO': ['000', '001', '002', '002', '001', '111'],
'DATE': ['2006-12-31 20:11:39','2006-12-27 20:11:53','2006-12-28 20:12:11','2006-12-28 20:12:13','2008-12-27 20:11:53','2006-12-30 20:11:39']})

print df 
df.groupby('CARD_NO').apply(lambda df:df['DATE'].values[df['DATE'].values.argmax()])

Original

原来的

CARD_NO                 DATE
0     000  2006-12-31 20:11:39
1     001  2006-12-27 20:11:53
2     002  2006-12-28 20:12:11
3     002  2006-12-28 20:12:13
4     001  2008-12-27 20:11:53
5     111  2006-12-30 20:11:39

Returned dataframe:

返回的数据帧：

CARD_NO
000        2006-12-31 20:11:39
001        2008-12-27 20:11:53
002        2006-12-28 20:12:13
111        2006-12-30 20:11:39

pandas - 获取由另一列索引的特定列的最新值（获取由另一列索引的特定列的最大值）

提问by enrishi

回答by pdifranc

回答by thetainted1

回答by Maximilian

回答by Tamelise

回答by Garrett

回答by Zihs

相关推荐

最近更新

标签

pandas - 获取由另一列索引的特定列的最新值（获取由另一列索引的特定列的最大值）

提问by enrishi

回答by pdifranc

回答by thetainted1

回答by Maximilian

回答by Tamelise

回答by Garrett

回答by Zihs

相关推荐

在 wpf 中以编程方式更改字体系列

WPF 选择 DataGrid 中的所有 CheckBox

wpf 操作返回了无效的状态代码“未授权”

wpf HttpClient PostAsync 和 SendAsync 的区别

相关推荐

最近更新

标签