pandas 过滤数据以仅获取当月行的第一天
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25797038/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Filter data to get only first day of the month rows
提问by tadalendas
I have a dataset of daily data. I need to get only the data of the first day of each month in the data set (The data is from 1972 to 2013). So for example I would need Index 20, Date 2013-12-02 value of 0.1555 to be extracted. The problem I have is that the first day for each month is different, so I cannot use a step such as relativedelta(months=1), how would I go about of extracting these values from my dataset?
我有一个每日数据的数据集。我只需要获取数据集中每个月第一天的数据(数据是从1972年到2013年)。因此,例如,我需要提取索引 20,日期 2013-12-02 值为 0.1555 的值。我的问题是每个月的第一天都不同,所以我不能使用诸如 relativedelta(months=1) 这样的步骤,我将如何从我的数据集中提取这些值?
Is there a similar command as I have found in another post for R? R - XTS: Get the first dates and values for each month from a daily time series with missing rows
是否有我在另一篇文章中为 R 找到的类似命令? R - XTS:从缺少行的每日时间序列中获取每个月的第一个日期和值
17 2013-12-05 0.1621
18 2013-12-04 0.1698
19 2013-12-03 0.1516
20 2013-12-02 0.1555
21 2013-11-29 0.1480
22 2013-11-27 0.1487
23 2013-11-26 0.1648
17 2013-12-05 0.1621
18 2013-12-04 0.1698
19 2013-12-03 0.1516
20 2013-12-02 0.1555
21 2013-11-29 0.1480
22 2013-11-27 0.1487
23 2013-11-26 0.1648
采纳答案by Andy Hayden
I would groupby the month and then get the zeroth (nth) row of each group.
我会按月份分组,然后获取每个组的第零(n)行。
First set as index (I think this is necessary):
首先设置为索引(我认为这是必要的):
In [11]: df1 = df.set_index('date')
In [12]: df1
Out[12]:
n val
date
2013-12-05 17 0.1621
2013-12-04 18 0.1698
2013-12-03 19 0.1516
2013-12-02 20 0.1555
2013-11-29 21 0.1480
2013-11-27 22 0.1487
2013-11-26 23 0.1648
Next sort, so that the first element is the first date of that month (Note: this doesn't appearto be necessary for nth, but I think that's actually a bug!):
接下来的排序,这样第一个元素是该月的第一天(注:这不会出现!所必需的n次,但我认为这实际上是一个错误):
In [13]: df1.sort_index(inplace=True)
In [14]: df1.groupby(pd.TimeGrouper('M')).nth(0)
Out[14]:
n val
date
2013-11-26 23 0.1648
2013-12-02 20 0.1555
another option is to resample and take the first entry:
另一种选择是重新采样并获取第一个条目:
In [15]: df1.resample('M', 'first')
Out[15]:
n val
date
2013-11-30 23 0.1648
2013-12-31 20 0.1555
Thinking about this, you can do this much simpler by extracting the month and then grouping by that:
考虑到这一点,您可以通过提取月份然后按月份分组来更简单地做到这一点:
In [21]: pd.DatetimeIndex(df.date).to_period('M')
Out[21]:
<class 'pandas.tseries.period.PeriodIndex'>
[2013-12, ..., 2013-11]
Length: 7, Freq: M
In [22]: df.groupby(pd.DatetimeIndex(df.date).to_period('M')).nth(0)
Out[22]:
n date val
0 17 2013-12-05 0.1621
4 21 2013-11-29 0.1480
This time the sortedness of df.dateis(correctly) relevant, if you know it's in descending date order you can use nth(-1):
这次的排序df.date是(正确)相关的,如果您知道它是按日期降序排列的,则可以使用nth(-1):
In [23]: df.groupby(pd.DatetimeIndex(df.date).to_period('M')).nth(-1)
Out[23]:
n date val
3 20 2013-12-02 0.1555
6 23 2013-11-26 0.1648
If this isn't guaranteed then sort by the date column first: df.sort('date').
如果这不能保证,那么首先按日期列排序:df.sort('date')。
回答by ely
One way is to add a column for the year, month and day:
一种方法是为年、月和日添加一列:
df['year'] = df.SomeDatetimeColumn.map(lambda x: x.year)
df['month'] = df.SomeDatetimeColumn.map(lambda x: x.month)
df['day'] = df.SomeDatetimeColumn.map(lambda x: x.day)
Then group by the year and month, order by day, and take only the first entry (which will be the minimum day entry).
然后按年和月分组,按天排序,只取第一个条目(这将是最少的一天条目)。
df.groupby(
['year', 'month']
).apply(lambda x: x.sort('day', ascending=True)).head(1)
The use of the lambdaexpressions makes this less than ideal for large data sets. You may not wish to grow the size of the data by keeping separately stored year, month, and day values. However, for these kinds of ad hoc date alignment problems, sooner or later having these values separated is very helpful.
lambda表达式的使用使得这对于大型数据集不太理想。您可能不希望通过单独存储年、月和日值来增加数据的大小。然而,对于这些类型的临时日期对齐问题,迟早将这些值分开是非常有帮助的。
Another approach is to group directly by a function of the datetime column:
另一种方法是直接按日期时间列的函数分组:
dfrm.groupby(
by=dfrm.dt.map(lambda x: (x.year, x.month))
).apply(lambda x: x.sort('dt', ascending=True).head(1))
Normally these problems arise because of a dysfunctional database or data storage schema that exists one level prior to the Python/pandas layer.
通常,这些问题的出现是由于在 Python/pandas 层之前存在一层功能失调的数据库或数据存储模式。
For example, in this situation, it should be commonplace to rely on the existence of a calendar database table or a calendar data set which contains (or makes it easy to query for) the earliest active date in a month relative to the given data set (such as, the first trading day, the first week day, the first business day, the first holiday, or whatever).
例如,在这种情况下,依赖日历数据库表或日历数据集的存在应该是司空见惯的,这些数据集包含(或易于查询)一个月中相对于给定数据集的最早活动日期(例如,第一个交易日、第一个工作日、第一个工作日、第一个假期,或其他)。
If a companion database table exists with this data, it should be easy to combine it with the dataset you already have loaded (say, by joining on the date column you already have) and then it's just a matter of applying a logical filter on the calendar data columns.
如果存在包含此数据的伴随数据库表,那么将它与您已经加载的数据集组合起来应该很容易(例如,通过加入您已经拥有的日期列),然后只需在数据上应用逻辑过滤器即可日历数据列。
This becomes especially important once you need to use date lags: for example, lining up a company's 1-month-ago market capitalization with the company's current-month stock return, to calculate a total return realized over that 1-month period.
一旦您需要使用日期滞后,这一点就变得尤为重要:例如,将公司 1 个月前的市值与公司当月的股票回报对齐,以计算该 1 个月期间实现的总回报。
This canbe done by lagging the columns in pandas with shift, or trying to do a complicated self-join that is likely very bug prone and creates the problem of perpetuating the particular date convention to every place downstream that uses data from that code.
这可以通过使用 延迟 pandas 中的列来完成shift,或者尝试进行复杂的自联接,这可能非常容易出错,并且会产生将特定日期约定延续到下游每个使用来自该代码的数据的地方的问题。
Much better to simply demand (or do it yourself) that the data must have properly normalized date features in its raw format (database, flat files, whatever) and to stop what you are doing, fix that date problem first, and only then get back to carrying out some analysis with the date data.
简单地要求(或自己做)数据必须在其原始格式(数据库、平面文件等)中具有正确规范化的日期特征并停止您正在做的事情,首先解决该日期问题,然后再获取要好得多回到对日期数据进行一些分析。
回答by Isaac Laughlin
import pandas as pd
dates = pd.date_range('2014-02-05', '2014-03-15', freq='D')
df = pd.DataFrame({'vals': range(len(dates))}, index=dates)
g = df.groupby(lambda x: x.strftime('%Y-%m'), axis=0)
g.apply(lambda x: x.index.min())
#Or depending on whether you want the index or the vals
g.apply(lambda x: x.ix[x.index.min()])
回答by citynorman
The above didn't work for me because I needed more than one row per month where the number of rows every month could change. This is what I did:
以上对我不起作用,因为我每月需要多于一行,而每个月的行数可能会发生变化。这就是我所做的:
dates_month = pd.bdate_range(df['date'].min(), df['date'].max(), freq='1M')
df_mth = df[df['date'].isin(dates_month)]

