pandas 在日期范围内切片熊猫数据框

Question

提问by G Garcia

I'm using pandas to analyse financial records.

我正在使用熊猫来分析财务记录。

I have a DataFramethat comes from a csv file that looks like this:

我有一个DataFrame来自 csv 文件的文件，如下所示：

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 800 entries, 2010-10-27 00:00:00 to 2011-07-12 00:00:00
Data columns:
debit                      800  non-null values
transaction_type           799  non-null values
transaction_date_raw       800  non-null values
credit                     800  non-null values
transaction_description    800  non-null values
account_number             800  non-null values
sort_code                  800  non-null values
balance                    800  non-null values
dtypes: float64(3), int64(1), object(4)

I am selecting a subset based on transaction amount:

我正在根据交易金额选择一个子集：

c1 = df['credit'].map(lambda x: x > 1000)
milestones = df[c1].sort()

and want to create slices of the original df based on the dates between the milestones:

并希望根据里程碑之间的日期创建原始 df 的切片：

delta = dt.timedelta(days=1)
for i in range(len(milestones.index)-1):
        start = milestones.index[i].date()
        end = milestones.index[i+1].date() - delta
        rng = date_range(start, end)

this generates a new series with the dates between my milestones.

这会生成一个新系列，其中包含我的里程碑之间的日期。

<class 'pandas.tseries.index.DatetimeIndex'>
[2010-11-29 00:00:00, ..., 2010-12-30 00:00:00]
Length: 32, Freq: D, Timezone: None

I have followed several approaches to slice my df using these new series (rng) but have failed:

我遵循了几种方法来使用这些新系列 (rng) 对我的 df 进行切片，但都失败了：

df.ix[start:end] or
df.ix[rng]

this raises: IndexError: invalid slice

这引发了：IndexError: invalid slice

df.reindex(rng) or df.reindex(index=rng)

raises: Exception: Reindexing only valid with uniquely valued Index objects

引发：异常：重新索引仅对唯一值的索引对象有效

x = [v for v in rng if v in df.index]
df[x]
df.ix[x]
df.index[x]

this also raises invalid slice, and so does:

这也会引发无效切片，也是如此：

df.truncate(start, end)

I'm new to pandas, I'm following the early release of the book from Oreilly, and really enjoying it. Any pointers would be appreciated.

我是 Pandas 的新手，我正在关注 Oreilly 早期发布的这本书，并且非常喜欢它。任何指针将不胜感激。

Answer 1

回答by Chang She

It looks like you've hit a couple of known bugs in non-unique index handling:

看起来您在非唯一索引处理中遇到了几个已知错误：

https://github.com/pydata/pandas/issues/1201/

https://github.com/pydata/pandas/issues/1587/

A bug fix release is coming out very soon so please check the pandas website or PyPI in a week or so.

一个错误修复版本很快就会发布，所以请在一周左右的时间内查看 pandas 网站或 PyPI。

Thanks

谢谢

Answer 2

回答by G Garcia

I've managed to circumvent some of the issues highlighted above, here is a "solution" until the bugs mentioned by Chang She are resolved.

我已经设法绕过了上面强调的一些问题，这里是一个“解决方案”，直到 Chang She 提到的错误得到解决。

I start with my original TimeSeries indexed DataFrame as before. I sort the df, this sorts records by date (using the TimeSeries index).

我像以前一样从我的原始 TimeSeries 索引 DataFrame 开始。我对 df 进行排序，这会按日期对记录进行排序（使用 TimeSeries 索引）。

df = df.sort()

once sorted I replace the df.index with a numerical index

一旦排序，我用数字索引替换 df.index

df.index = range(len(df))

I subsequently extract my milestones as before with the difference now these will have an index which is a positive int, and create a list of that index:

我随后像以前一样提取我的里程碑，不同之处在于现在这些里程碑将具有一个正整数的索引，并创建该索引的列表：

milestones_list = milestones_df.index.tolist()

and extract the data between my milestones from the original df using the numeric index like so:

并使用数字索引从原始 df 中提取里程碑之间的数据，如下所示：

datasets = {}
    for milestone in milestones_list:
        milestone_index = milestones_list.index(milestone)
        print "milestone {0} index {1}".format(milestone, milestone_index)
        if milestone_index < len(milestones_list) -1:
            x = df[milestones_df.index[milestone_index]:milestones_df.index[milestone_index+1]]
        else:
            x = df[milestones_df.index[milestone_index]:df.index.max()]

        n = str(int(x.index.min())) +'-'+  str(int(x.index.max()))
        datasets[n] = x

this creates a dict with DataFrames for each milestone time interval named as the index intervals they represent.

这将为每个里程碑时间间隔创建一个带有 DataFrame 的字典，命名为它们所代表的索引间隔。

print datasets.keys()
['592-650', '448-527', '382-447', '264-318', '319-381', '118-198', '528-591', '728-798', '54-117', '199-263', '651-727']

Although admittedly not the ideal solution I hope it helps someone with similar issues.

虽然不可否认不是理想的解决方案，但我希望它可以帮助有类似问题的人。

pandas 在日期范围内切片熊猫数据框

提问by G Garcia

回答by Chang She

回答by G Garcia

相关推荐

最近更新

标签

pandas 在日期范围内切片熊猫数据框

提问by G Garcia

回答by Chang She

回答by G Garcia

相关推荐

pandas 如何使用日期时间对数据框进行切片？

使用 psycopg2 将 Pandas DataFrame 快速插入 Postgres DB

pandas - 获取由另一列索引的特定列的最新值（获取由另一列索引的特定列的最大值）

如何使用 Python Pandas 在特定日期时间索引后获取最近的单行

相关推荐

最近更新

标签