Python 计算数据帧组内的差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20648346/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:56:33  来源:igfitidea点击:

Computing diffs within groups of a dataframe

pythonpandas

提问by 8one6

Say I have a dataframe with 3 columns: Date, Ticker, Value (no index, at least to start with). I have many dates and many tickers, but each (ticker, date)tuple is unique. (But obviously the same date will show up in many rows since it will be there for multiple tickers, and the same ticker will show up in multiple rows since it will be there for many dates.)

假设我有一个包含 3 列的数据框:Date、Ticker、Value(没有索引,至少一开始是这样)。我有很多日期和行情,但每个(ticker, date)元组都是独一无二的。(但显然相同的日期会出现在多行中,因为它会出现在多个股票行情中,而同一个股票行情会出现在多行中,因为它会出现在多个日期中。)

Initially, my rows in a specific order, but not sorted by any of the columns.

最初,我的行按特定顺序排列,但未按任何列排序。

I would like to compute first differences (daily changes) of each ticker (ordered by date) and put these in a new column in my dataframe. Given this context, I cannotsimply do

我想计算每个代码(按日期排序)的第一个差异(每日变化),并将它们放在我的数据框中的新列中。鉴于这种情况,我不能简单地做

df['diffs'] = df['value'].diff()

because adjacent rows do not come from the same ticker. Sorting like this:

因为相邻的行不是来自同一个代码。排序是这样的:

df = df.sort(['ticker', 'date'])
df['diffs'] = df['value'].diff()

doesn'tsolve the problem because there will be "borders". I.e. after that sort, the last value for one ticker will be above the first value for the next ticker. And computing differences then would take a difference between two tickers. I don't want this. I want the earliest date for each ticker to wind up with an NaNin its diff column.

解决这个问题,因为将有“边界”。即在排序之后,一个代码的最后一个值将高于下一个代码的第一个值。然后计算差异将采用两个股票代码之间的差异。我不要这个。我希望每个股票的最早日期NaN在其差异列中结束。

This seems like an obvious time to use groupbybut for whatever reason, I can't seem to get it to work properly. To be clear, I would like to perform the following process:

这似乎是一个明显的使用时间,groupby但无论出于何种原因,我似乎无法让它正常工作。明确地说,我想执行以下过程:

  1. Group rows based on their ticker
  2. Within each group, sort rows by their date
  3. Within each sorted group, compute differences of the valuecolumn
  4. Put these differences into the original dataframe in a new diffscolumn (ideally leaving the original dataframe order in tact.)
  1. 根据行对行进行分组 ticker
  2. 在每个组内,按行对行进行排序 date
  3. 在每个排序组内,计算value列的差异
  4. 将这些差异放入新diffs列中的原始数据帧中(理想情况下保留原始数据帧顺序。)

I have to imagine this is a one-liner. But what am I missing?

我不得不想象这是一个单线。但我错过了什么?



Edit at 9:00pm 2013-12-17

2013-12-17 9:00pm 编辑

Ok...some progress. I can do the following to get a new dataframe:

好吧……有些进展。我可以执行以下操作来获取新的数据框:

result = df.set_index(['ticker', 'date'])\
    .groupby(level='ticker')\
    .transform(lambda x: x.sort_index().diff())\
    .reset_index()

But if I understand the mechanics of groupby, my rows will now be sorted first by tickerand then by date. Is that correct? If so, would I need to do a merge to append the differences column (currently in result['current']to the original dataframe df?

但是如果我了解 groupby 的机制,我的行现在将首先按 排序ticker,然后按date。那是对的吗?如果是这样,我是否需要进行合并以附加差异列(当前在result['current']原始数据框中df

采纳答案by behzad.nouri

wouldn't be just easier to do what yourself describe, namely

做你自己描述的事情不会更容易,即

df.sort(['ticker', 'date'], inplace=True)
df['diffs'] = df['value'].diff()

and then correct for borders:

然后纠正边界:

mask = df.ticker != df.ticker.shift(1)
df['diffs'][mask] = np.nan

to maintain the original index you may do idx = df.indexin the beginning, and then at the end you can do df.reindex(idx), or if it is a huge dataframe, perform the operations on

维护原始索引,您可以idx = df.index在开始时做,然后在最后您可以做df.reindex(idx),或者如果它是一个巨大的数据帧,则在上执行操作

df.filter(['ticker', 'date', 'value'])

and then jointhe two dataframes at the end.

然后join是最后的两个数据帧。

edit: alternatively, ( though still not using groupby)

编辑:或者,(虽然仍然没有使用groupby

df.set_index(['ticker','date'], inplace=True)
df.sort_index(inplace=True)
df['diffs'] = np.nan 

for idx in df.index.levels[0]:
    df.diffs[idx] = df.value[idx].diff()

for

为了

   date ticker  value
0    63      C   1.65
1    88      C  -1.93
2    22      C  -1.29
3    76      A  -0.79
4    72      B  -1.24
5    34      A  -0.23
6    92      B   2.43
7    22      A   0.55
8    32      A  -2.50
9    59      B  -1.01

this will produce:

这将产生:

             value  diffs
ticker date              
A      22     0.55    NaN
       32    -2.50  -3.05
       34    -0.23   2.27
       76    -0.79  -0.56
B      59    -1.01    NaN
       72    -1.24  -0.23
       92     2.43   3.67
C      22    -1.29    NaN
       63     1.65   2.94
       88    -1.93  -3.58

回答by HYRY

You can use pivotto convert the dataframe into date-ticker table, here is an example:

您可以使用pivot将数据框转换为日期行情表,这是一个示例:

create the test data first:

首先创建测试数据:

import pandas as pd
import numpy as np
import random
from itertools import product

dates = pd.date_range(start="2013-12-01",  periods=10).to_native_types()
ticks = "ABCDEF"
pairs = list(product(dates, ticks))
random.shuffle(pairs)
pairs = pairs[:-5]
values = np.random.rand(len(pairs))

dates, ticks = zip(*pairs)
df = pd.DataFrame({"date":dates, "tick":ticks, "value":values})

convert the dataframe by pivotformat:

pivot格式转换数据帧:

df2 = df.pivot(index="date", columns="tick", values="value")

fill NaN:

填充 NaN:

df2 = df2.fillna(method="ffill")

call diff()method:

调用diff()方法:

df2.diff()

here is what df2looks like:

这是df2看起来像:

tick               A         B         C         D         E         F
date                                                                  
2013-12-01  0.077260  0.084008  0.711626  0.071267  0.811979  0.429552
2013-12-02  0.106349  0.141972  0.457850  0.338869  0.721703  0.217295
2013-12-03  0.330300  0.893997  0.648687  0.628502  0.543710  0.217295
2013-12-04  0.640902  0.827559  0.243816  0.819218  0.543710  0.190338
2013-12-05  0.263300  0.604084  0.655723  0.299913  0.756980  0.135087
2013-12-06  0.278123  0.243264  0.907513  0.723819  0.506553  0.717509
2013-12-07  0.960452  0.243264  0.357450  0.160799  0.506553  0.194619
2013-12-08  0.670322  0.256874  0.637153  0.582727  0.628581  0.159636
2013-12-09  0.226519  0.284157  0.388755  0.325461  0.957234  0.810376
2013-12-10  0.958412  0.852611  0.472012  0.832173  0.957234  0.723234

回答by 8one6

Ok. Lots of thinking about this, and I think this is my favorite combination of the solutions above and a bit of playing around. Original data lives in df:

好的。对此进行了很多思考,我认为这是我最喜欢的上述解决方案的组合以及一些玩弄。原始数据位于df

df.sort(['ticker', 'date'], inplace=True)

# for this example, with diff, I think this syntax is a bit clunky
# but for more general examples, this should be good.  But can we do better?
df['diffs'] = df.groupby(['ticker'])['value'].transform(lambda x: x.diff()) 

df.sort_index(inplace=True)

This will accomplish everything I want. And what I really like is that it can be generalized to cases where you want to apply a function more intricate than diff. In particular, you could do things like lambda x: pd.rolling_mean(x, 20, 20)to make a column of rolling means where you don't need to worry about each ticker's data being corrupted by that of any other ticker (groupbytakes care of that for you...).

这将完成我想要的一切。我真正喜欢的是它可以推广到您想要应用比diff. 特别是,你可以做一些事情,比如lambda x: pd.rolling_mean(x, 20, 20)制作一列滚动方式,你不需要担心每个股票的数据被任何其他股票的数据破坏(groupby为你照顾......)。

So here's the question I'm left with...why doesn't the following work for the line that starts df['diffs']:

所以这是我留下的问题......为什么以下内容不适用于开头的行df['diffs']

df['diffs'] = df.groupby[('ticker')]['value'].transform(np.diff)

when I do that, I get a diffscolumn full of 0's. Any thoughts on that?

当我这样做时,我得到一diffs列全是 0。对此有什么想法吗?

回答by Amelio Vazquez-Reina

Here is a solution that builds on what @behzad.nouri wrote, but using pd.IndexSlice:

这是一个基于@behzad.nouri 所写的解决方案,但使用pd.IndexSlice

df =  df.set_index(['ticker', 'date']).sort_index()[['value']]
df['diff'] = np.nan
idx = pd.IndexSlice

for ix in df.index.levels[0]:
    df.loc[ idx[ix,:], 'diff'] = df.loc[idx[ix,:], 'value' ].diff()

For:

为了:

> df
   date ticker  value
0    63      C   1.65
1    88      C  -1.93
2    22      C  -1.29
3    76      A  -0.79
4    72      B  -1.24
5    34      A  -0.23
6    92      B   2.43
7    22      A   0.55
8    32      A  -2.50
9    59      B  -1.01

It returns:

它返回:

> df
             value  diff
ticker date             
A      22     0.55   NaN
       32    -2.50 -3.05
       34    -0.23  2.27
       76    -0.79 -0.56
B      59    -1.01   NaN
       72    -1.24 -0.23
       92     2.43  3.67
C      22    -1.29   NaN
       63     1.65  2.94
       88    -1.93 -3.58