pandas 在日期范围内切片熊猫数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11360675/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
slicing pandas dataframe on date range
提问by G Garcia
I'm using pandas to analyse financial records.
我正在使用熊猫来分析财务记录。
I have a DataFramethat comes from a csv file that looks like this:
我有一个DataFrame来自 csv 文件的文件,如下所示:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 800 entries, 2010-10-27 00:00:00 to 2011-07-12 00:00:00
Data columns:
debit 800 non-null values
transaction_type 799 non-null values
transaction_date_raw 800 non-null values
credit 800 non-null values
transaction_description 800 non-null values
account_number 800 non-null values
sort_code 800 non-null values
balance 800 non-null values
dtypes: float64(3), int64(1), object(4)
I am selecting a subset based on transaction amount:
我正在根据交易金额选择一个子集:
c1 = df['credit'].map(lambda x: x > 1000)
milestones = df[c1].sort()
and want to create slices of the original df based on the dates between the milestones:
并希望根据里程碑之间的日期创建原始 df 的切片:
delta = dt.timedelta(days=1)
for i in range(len(milestones.index)-1):
start = milestones.index[i].date()
end = milestones.index[i+1].date() - delta
rng = date_range(start, end)
this generates a new series with the dates between my milestones.
这会生成一个新系列,其中包含我的里程碑之间的日期。
<class 'pandas.tseries.index.DatetimeIndex'>
[2010-11-29 00:00:00, ..., 2010-12-30 00:00:00]
Length: 32, Freq: D, Timezone: None
I have followed several approaches to slice my df using these new series (rng) but have failed:
我遵循了几种方法来使用这些新系列 (rng) 对我的 df 进行切片,但都失败了:
df.ix[start:end] or
df.ix[rng]
this raises: IndexError: invalid slice
这引发了:IndexError: invalid slice
df.reindex(rng) or df.reindex(index=rng)
raises: Exception: Reindexing only valid with uniquely valued Index objects
引发:异常:重新索引仅对唯一值的索引对象有效
x = [v for v in rng if v in df.index]
df[x]
df.ix[x]
df.index[x]
this also raises invalid slice, and so does:
这也会引发无效切片,也是如此:
df.truncate(start, end)
I'm new to pandas, I'm following the early release of the book from Oreilly, and really enjoying it. Any pointers would be appreciated.
我是 Pandas 的新手,我正在关注 Oreilly 早期发布的这本书,并且非常喜欢它。任何指针将不胜感激。
回答by Chang She
It looks like you've hit a couple of known bugs in non-unique index handling:
看起来您在非唯一索引处理中遇到了几个已知错误:
https://github.com/pydata/pandas/issues/1201/
https://github.com/pydata/pandas/issues/1201/
https://github.com/pydata/pandas/issues/1587/
https://github.com/pydata/pandas/issues/1587/
A bug fix release is coming out very soon so please check the pandas website or PyPI in a week or so.
一个错误修复版本很快就会发布,所以请在一周左右的时间内查看 pandas 网站或 PyPI。
Thanks
谢谢
回答by G Garcia
I've managed to circumvent some of the issues highlighted above, here is a "solution" until the bugs mentioned by Chang She are resolved.
我已经设法绕过了上面强调的一些问题,这里是一个“解决方案”,直到 Chang She 提到的错误得到解决。
I start with my original TimeSeries indexed DataFrame as before. I sort the df, this sorts records by date (using the TimeSeries index).
我像以前一样从我的原始 TimeSeries 索引 DataFrame 开始。我对 df 进行排序,这会按日期对记录进行排序(使用 TimeSeries 索引)。
df = df.sort()
once sorted I replace the df.index with a numerical index
一旦排序,我用数字索引替换 df.index
df.index = range(len(df))
I subsequently extract my milestones as before with the difference now these will have an index which is a positive int, and create a list of that index:
我随后像以前一样提取我的里程碑,不同之处在于现在这些里程碑将具有一个正整数的索引,并创建该索引的列表:
milestones_list = milestones_df.index.tolist()
and extract the data between my milestones from the original df using the numeric index like so:
并使用数字索引从原始 df 中提取里程碑之间的数据,如下所示:
datasets = {}
for milestone in milestones_list:
milestone_index = milestones_list.index(milestone)
print "milestone {0} index {1}".format(milestone, milestone_index)
if milestone_index < len(milestones_list) -1:
x = df[milestones_df.index[milestone_index]:milestones_df.index[milestone_index+1]]
else:
x = df[milestones_df.index[milestone_index]:df.index.max()]
n = str(int(x.index.min())) +'-'+ str(int(x.index.max()))
datasets[n] = x
this creates a dict with DataFrames for each milestone time interval named as the index intervals they represent.
这将为每个里程碑时间间隔创建一个带有 DataFrame 的字典,命名为它们所代表的索引间隔。
print datasets.keys()
['592-650', '448-527', '382-447', '264-318', '319-381', '118-198', '528-591', '728-798', '54-117', '199-263', '651-727']
Although admittedly not the ideal solution I hope it helps someone with similar issues.
虽然不可否认不是理想的解决方案,但我希望它可以帮助有类似问题的人。

