Python 计算数据帧组内的差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20648346/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Computing diffs within groups of a dataframe
提问by 8one6
Say I have a dataframe with 3 columns: Date, Ticker, Value (no index, at least to start with). I have many dates and many tickers, but each (ticker, date)tuple is unique. (But obviously the same date will show up in many rows since it will be there for multiple tickers, and the same ticker will show up in multiple rows since it will be there for many dates.)
假设我有一个包含 3 列的数据框:Date、Ticker、Value(没有索引,至少一开始是这样)。我有很多日期和行情,但每个(ticker, date)元组都是独一无二的。(但显然相同的日期会出现在多行中,因为它会出现在多个股票行情中,而同一个股票行情会出现在多行中,因为它会出现在多个日期中。)
Initially, my rows in a specific order, but not sorted by any of the columns.
最初,我的行按特定顺序排列,但未按任何列排序。
I would like to compute first differences (daily changes) of each ticker (ordered by date) and put these in a new column in my dataframe. Given this context, I cannotsimply do
我想计算每个代码(按日期排序)的第一个差异(每日变化),并将它们放在我的数据框中的新列中。鉴于这种情况,我不能简单地做
df['diffs'] = df['value'].diff()
because adjacent rows do not come from the same ticker. Sorting like this:
因为相邻的行不是来自同一个代码。排序是这样的:
df = df.sort(['ticker', 'date'])
df['diffs'] = df['value'].diff()
doesn'tsolve the problem because there will be "borders". I.e. after that sort, the last value for one ticker will be above the first value for the next ticker. And computing differences then would take a difference between two tickers. I don't want this. I want the earliest date for each ticker to wind up with an NaNin its diff column.
不解决这个问题,因为将有“边界”。即在排序之后,一个代码的最后一个值将高于下一个代码的第一个值。然后计算差异将采用两个股票代码之间的差异。我不要这个。我希望每个股票的最早日期NaN在其差异列中结束。
This seems like an obvious time to use groupbybut for whatever reason, I can't seem to get it to work properly. To be clear, I would like to perform the following process:
这似乎是一个明显的使用时间,groupby但无论出于何种原因,我似乎无法让它正常工作。明确地说,我想执行以下过程:
- Group rows based on their
ticker - Within each group, sort rows by their
date - Within each sorted group, compute differences of the
valuecolumn - Put these differences into the original dataframe in a new
diffscolumn (ideally leaving the original dataframe order in tact.)
- 根据行对行进行分组
ticker - 在每个组内,按行对行进行排序
date - 在每个排序组内,计算
value列的差异 - 将这些差异放入新
diffs列中的原始数据帧中(理想情况下保留原始数据帧顺序。)
I have to imagine this is a one-liner. But what am I missing?
我不得不想象这是一个单线。但我错过了什么?
Edit at 9:00pm 2013-12-17
2013-12-17 9:00pm 编辑
Ok...some progress. I can do the following to get a new dataframe:
好吧……有些进展。我可以执行以下操作来获取新的数据框:
result = df.set_index(['ticker', 'date'])\
.groupby(level='ticker')\
.transform(lambda x: x.sort_index().diff())\
.reset_index()
But if I understand the mechanics of groupby, my rows will now be sorted first by tickerand then by date. Is that correct? If so, would I need to do a merge to append the differences column (currently in result['current']to the original dataframe df?
但是如果我了解 groupby 的机制,我的行现在将首先按 排序ticker,然后按date。那是对的吗?如果是这样,我是否需要进行合并以附加差异列(当前在result['current']原始数据框中df?
采纳答案by behzad.nouri
wouldn't be just easier to do what yourself describe, namely
做你自己描述的事情不会更容易,即
df.sort(['ticker', 'date'], inplace=True)
df['diffs'] = df['value'].diff()
and then correct for borders:
然后纠正边界:
mask = df.ticker != df.ticker.shift(1)
df['diffs'][mask] = np.nan
to maintain the original index you may do idx = df.indexin the beginning, and then at the end you can do df.reindex(idx), or if it is a huge dataframe, perform the operations on
维护原始索引,您可以idx = df.index在开始时做,然后在最后您可以做df.reindex(idx),或者如果它是一个巨大的数据帧,则在上执行操作
df.filter(['ticker', 'date', 'value'])
and then jointhe two dataframes at the end.
然后join是最后的两个数据帧。
edit: alternatively, ( though still not using groupby)
编辑:或者,(虽然仍然没有使用groupby)
df.set_index(['ticker','date'], inplace=True)
df.sort_index(inplace=True)
df['diffs'] = np.nan
for idx in df.index.levels[0]:
df.diffs[idx] = df.value[idx].diff()
for
为了
date ticker value
0 63 C 1.65
1 88 C -1.93
2 22 C -1.29
3 76 A -0.79
4 72 B -1.24
5 34 A -0.23
6 92 B 2.43
7 22 A 0.55
8 32 A -2.50
9 59 B -1.01
this will produce:
这将产生:
value diffs
ticker date
A 22 0.55 NaN
32 -2.50 -3.05
34 -0.23 2.27
76 -0.79 -0.56
B 59 -1.01 NaN
72 -1.24 -0.23
92 2.43 3.67
C 22 -1.29 NaN
63 1.65 2.94
88 -1.93 -3.58
回答by HYRY
You can use pivotto convert the dataframe into date-ticker table, here is an example:
您可以使用pivot将数据框转换为日期行情表,这是一个示例:
create the test data first:
首先创建测试数据:
import pandas as pd
import numpy as np
import random
from itertools import product
dates = pd.date_range(start="2013-12-01", periods=10).to_native_types()
ticks = "ABCDEF"
pairs = list(product(dates, ticks))
random.shuffle(pairs)
pairs = pairs[:-5]
values = np.random.rand(len(pairs))
dates, ticks = zip(*pairs)
df = pd.DataFrame({"date":dates, "tick":ticks, "value":values})
convert the dataframe by pivotformat:
按pivot格式转换数据帧:
df2 = df.pivot(index="date", columns="tick", values="value")
fill NaN:
填充 NaN:
df2 = df2.fillna(method="ffill")
call diff()method:
调用diff()方法:
df2.diff()
here is what df2looks like:
这是df2看起来像:
tick A B C D E F
date
2013-12-01 0.077260 0.084008 0.711626 0.071267 0.811979 0.429552
2013-12-02 0.106349 0.141972 0.457850 0.338869 0.721703 0.217295
2013-12-03 0.330300 0.893997 0.648687 0.628502 0.543710 0.217295
2013-12-04 0.640902 0.827559 0.243816 0.819218 0.543710 0.190338
2013-12-05 0.263300 0.604084 0.655723 0.299913 0.756980 0.135087
2013-12-06 0.278123 0.243264 0.907513 0.723819 0.506553 0.717509
2013-12-07 0.960452 0.243264 0.357450 0.160799 0.506553 0.194619
2013-12-08 0.670322 0.256874 0.637153 0.582727 0.628581 0.159636
2013-12-09 0.226519 0.284157 0.388755 0.325461 0.957234 0.810376
2013-12-10 0.958412 0.852611 0.472012 0.832173 0.957234 0.723234
回答by 8one6
Ok. Lots of thinking about this, and I think this is my favorite combination of the solutions above and a bit of playing around. Original data lives in df:
好的。对此进行了很多思考,我认为这是我最喜欢的上述解决方案的组合以及一些玩弄。原始数据位于df:
df.sort(['ticker', 'date'], inplace=True)
# for this example, with diff, I think this syntax is a bit clunky
# but for more general examples, this should be good. But can we do better?
df['diffs'] = df.groupby(['ticker'])['value'].transform(lambda x: x.diff())
df.sort_index(inplace=True)
This will accomplish everything I want. And what I really like is that it can be generalized to cases where you want to apply a function more intricate than diff. In particular, you could do things like lambda x: pd.rolling_mean(x, 20, 20)to make a column of rolling means where you don't need to worry about each ticker's data being corrupted by that of any other ticker (groupbytakes care of that for you...).
这将完成我想要的一切。我真正喜欢的是它可以推广到您想要应用比diff. 特别是,你可以做一些事情,比如lambda x: pd.rolling_mean(x, 20, 20)制作一列滚动方式,你不需要担心每个股票的数据被任何其他股票的数据破坏(groupby为你照顾......)。
So here's the question I'm left with...why doesn't the following work for the line that starts df['diffs']:
所以这是我留下的问题......为什么以下内容不适用于开头的行df['diffs']:
df['diffs'] = df.groupby[('ticker')]['value'].transform(np.diff)
when I do that, I get a diffscolumn full of 0's. Any thoughts on that?
当我这样做时,我得到一diffs列全是 0。对此有什么想法吗?
回答by Amelio Vazquez-Reina
Here is a solution that builds on what @behzad.nouri wrote, but using pd.IndexSlice:
这是一个基于@behzad.nouri 所写的解决方案,但使用pd.IndexSlice:
df = df.set_index(['ticker', 'date']).sort_index()[['value']]
df['diff'] = np.nan
idx = pd.IndexSlice
for ix in df.index.levels[0]:
df.loc[ idx[ix,:], 'diff'] = df.loc[idx[ix,:], 'value' ].diff()
For:
为了:
> df
date ticker value
0 63 C 1.65
1 88 C -1.93
2 22 C -1.29
3 76 A -0.79
4 72 B -1.24
5 34 A -0.23
6 92 B 2.43
7 22 A 0.55
8 32 A -2.50
9 59 B -1.01
It returns:
它返回:
> df
value diff
ticker date
A 22 0.55 NaN
32 -2.50 -3.05
34 -0.23 2.27
76 -0.79 -0.56
B 59 -1.01 NaN
72 -1.24 -0.23
92 2.43 3.67
C 22 -1.29 NaN
63 1.65 2.94
88 -1.93 -3.58

