Python 选择两个日期之间的 DataFrame 行

Question

提问by darkpool

I am creating a DataFrame from a csv as follows:

我正在从 csv 创建一个 DataFrame，如下所示：

stock = pd.read_csv('data_in/' + filename + '.csv', skipinitialspace=True)

The DataFrame has a date column. Is there a way to create a new DataFrame (or just overwrite the existing one) which only contains rows with date values that fall within a specified date range or between two specified date values?

DataFrame 有一个日期列。有没有办法创建一个新的 DataFrame（或只是覆盖现有的），它只包含日期值落在指定日期范围内或两个指定日期值之间的行？

Answer 1

采纳答案by unutbu

There are two possible solutions:

有两种可能的解决方案：

Use a boolean mask, then use df.loc[mask]
Set the date column as a DatetimeIndex, then use df[start_date : end_date]

使用布尔掩码，然后使用 df.loc[mask]
将日期列设置为 DatetimeIndex，然后使用 df[start_date : end_date]

Using a boolean mask:

使用布尔掩码：

Ensure df['date']is a Series with dtype datetime64[ns]:

确保df['date']是一个带有 dtype 的系列datetime64[ns]：

df['date'] = pd.to_datetime(df['date'])

Make a boolean mask. start_dateand end_datecan be datetime.datetimes, np.datetime64s, pd.Timestamps, or even datetime strings:

制作一个布尔掩码。start_date并且end_date可以是datetime.datetimes、 np.datetime64s、pd.Timestamps，甚至是日期时间字符串：

#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)

Select the sub-DataFrame:

选择子数据框：

df.loc[mask]

or re-assign to df

或重新分配给 df

df = df.loc[mask]

For example,

例如，

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])

yields

产量

            0         1         2       date
153  0.208875  0.727656  0.037787 2000-06-02
154  0.750800  0.776498  0.237716 2000-06-03
155  0.812008  0.127338  0.397240 2000-06-04
156  0.639937  0.207359  0.533527 2000-06-05
157  0.416998  0.845658  0.872826 2000-06-06
158  0.440069  0.338690  0.847545 2000-06-07
159  0.202354  0.624833  0.740254 2000-06-08
160  0.465746  0.080888  0.155452 2000-06-09
161  0.858232  0.190321  0.432574 2000-06-10

Using a DatetimeIndex:

使用DatetimeIndex：

If you are going to do a lot of selections by date, it may be quicker to set the datecolumn as the index first. Then you can select rows by date using df.loc[start_date:end_date].

如果您要按日期进行大量选择，首先将date列设置为索引可能会更快。然后，您可以使用df.loc[start_date:end_date].按日期选择行。

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])

yields

产量

                   0         1         2
date                                    
2000-06-01  0.040457  0.326594  0.492136    # <- includes start_date
2000-06-02  0.279323  0.877446  0.464523
2000-06-03  0.328068  0.837669  0.608559
2000-06-04  0.107959  0.678297  0.517435
2000-06-05  0.131555  0.418380  0.025725
2000-06-06  0.999961  0.619517  0.206108
2000-06-07  0.129270  0.024533  0.154769
2000-06-08  0.441010  0.741781  0.470402
2000-06-09  0.682101  0.375660  0.009916
2000-06-10  0.754488  0.352293  0.339337

While Python list indexing, e.g. seq[start:end]includes startbut not end, in contrast, Pandas df.loc[start_date : end_date]includes bothend-points in the result if they are in the index. Neither start_datenor end_datehas to be in the index however.

虽然 Python 列表索引，例如seq[start:end]包括start但不包括end，相反，如果它们在索引中，Pandas 会在结果中df.loc[start_date : end_date]包含两个端点。然而，既不start_date也不end_date必须在索引中。

Also note that pd.read_csvhas a parse_datesparameterwhich you could use to parse the datecolumn as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).

另请注意，pd.read_csv有一个parse_dates参数可用于将date列解析为datetime64s。因此，如果您使用parse_dates，则不需要使用df['date'] = pd.to_datetime(df['date']).

Answer 2

回答by Christin Jose

I feel the best option will be to use the direct checks rather than using loc function:

我觉得最好的选择是使用直接检查而不是使用 loc 函数：

df = df[(df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')]

It works for me.

这个对我有用。

Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.

带有切片的 loc 函数的主要问题是限制应该存在于实际值中，否则将导致 KeyError。

Answer 3

回答by Jonny Brooks

You can use the isinmethod on the datecolumn like so df[df["date"].isin(pd.date_range(start_date, end_date))]

您可以像这样isin在date列上使用该方法df[df["date"].isin(pd.date_range(start_date, end_date))]

Note:This only works with dates (as the question asks) and not timestamps.

注意：这只适用于日期（如问题所问）而不适用于时间戳。

Example:

例子：

import numpy as np   
import pandas as pd

# Make a DataFrame with dates and random numbers
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')

# Select the rows between two dates
in_range_df = df[df["date"].isin(pd.date_range("2017-01-15", "2017-01-20"))]

print(in_range_df)  # print result

which gives

这使

           0         1         2       date
14  0.960974  0.144271  0.839593 2017-01-15
15  0.814376  0.723757  0.047840 2017-01-16
16  0.911854  0.123130  0.120995 2017-01-17
17  0.505804  0.416935  0.928514 2017-01-18
18  0.204869  0.708258  0.170792 2017-01-19
19  0.014389  0.214510  0.045201 2017-01-20

Answer 4

回答by Abhinav Anand

Keeping the solution simple and pythonic, I would suggest you to try this.

保持解决方案简单和pythonic，我建议你试试这个。

In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.

如果您要经常这样做，最好的解决方案是首先将日期列设置为索引，这将转换 DateTimeIndex 中的列，并使用以下条件对任何日期范围进行切片。

import pandas as pd

data_frame = data_frame.set_index('date')

df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]

Answer 5

回答by Arraval

I prefer not to alter the df.

我不想改变df.

An option is to retrieve the indexof the startand enddates:

一种选择是检索index的start和end日期：

import numpy as np   
import pandas as pd

#Dummy DataFrame
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')

#Get the index of the start and end dates respectively
start = df[df['date']=='2017-01-07'].index[0]
end = df[df['date']=='2017-01-14'].index[0]

#Show the sliced df (from 2017-01-07 to 2017-01-14)
df.loc[start:end]

which results in:

这导致：

     0   1   2       date
6  0.5 0.8 0.8 2017-01-07
7  0.0 0.7 0.3 2017-01-08
8  0.8 0.9 0.0 2017-01-09
9  0.0 0.2 1.0 2017-01-10
10 0.6 0.1 0.9 2017-01-11
11 0.5 0.3 0.9 2017-01-12
12 0.5 0.4 0.3 2017-01-13
13 0.4 0.9 0.9 2017-01-14

Answer 6

回答by pomber

You can also use between:

您还可以使用between：

df[df.some_date.between(start_date, end_date)]

Answer 7

回答by MattR

With my testing of pandasversion 0.22.0you can now answer this question easier with more readable code by simply using between.

通过我对pandas版本的测试，0.22.0您现在只需使用between.

# create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019
df = pd.DataFrame({'dates':pd.date_range('2018-01-01','2019-01-01')})

Let's say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019:

假设您想获取 2018 年 11 月 27 日和 2019 年 1 月 15 日之间的日期：

# use the between statement to get a boolean mask
df['dates'].between('2018-11-27','2019-01-15', inclusive=False)

0    False
1    False
2    False
3    False
4    False

# you can pass this boolean mask straight to loc
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=False)]

    dates
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
335 2018-12-02

Notice the inclusive argument. veryhelpful when you want to be explicit about your range. notice when set to True we return Nov 27th of 2018 as well:

注意包含参数。当你想明确你的范围时非常有帮助。请注意，当设置为 True 时，我们也会返回 2018 年 11 月 27 日：

df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]

    dates
330 2018-11-27
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01

This method is also faster than the previously mentioned isinmethod:

这种方法也比前面提到的isin方法更快：

%%timeit -n 5
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
868 μs ± 164 μs per loop (mean ± std. dev. of 7 runs, 5 loops each)


%%timeit -n 5

df.loc[df['dates'].isin(pd.date_range('2018-01-01','2019-01-01'))]
1.53 ms ± 305 μs per loop (mean ± std. dev. of 7 runs, 5 loops each)

However, it is notfaster than the currently accepted answer, provided by unutbu, only if the mask is already created. but if the mask is dynamic and needs to be reassigned over and over, my method maybe more efficient:

但是，它并不比 unutbu 提供的当前接受的答案快，前提是已经创建了掩码。但是如果掩码是动态的并且需要一遍又一遍地重新分配，我的方法可能更有效：

# already create the mask THEN time the function

start_date = dt.datetime(2018,11,27)
end_date = dt.datetime(2019,1,15)
mask = (df['dates'] > start_date) & (df['dates'] <= end_date)

%%timeit -n 5
df.loc[mask]
191 μs ± 28.5 μs per loop (mean ± std. dev. of 7 runs, 5 loops each)

Answer 8

回答by Jaroslav Bezděk

Another option, how to achieve this, is by using pandas.DataFrame.query()method. Let me show you an example on the following data frame called df.

另一种选择，如何实现这一点，是通过使用pandas.DataFrame.query()方法。让我向您展示以下名为df.

>>> df = pd.DataFrame(np.random.random((5, 1)), columns=['col_1'])
>>> df['date'] = pd.date_range('2020-1-1', periods=5, freq='D')
>>> print(df)
      col_1       date
0  0.015198 2020-01-01
1  0.638600 2020-01-02
2  0.348485 2020-01-03
3  0.247583 2020-01-04
4  0.581835 2020-01-05

As an argument, use the condition for filtering like this:

作为参数，使用条件进行过滤，如下所示：

>>> start_date, end_date = '2020-01-02', '2020-01-04'
>>> print(df.query('date >= @start_date and date <= @end_date'))
      col_1       date
1  0.244104 2020-01-02
2  0.374775 2020-01-03
3  0.510053 2020-01-04

If you do not want to include boundaries, just change the condition like following:

如果您不想包含边界，只需更改如下条件即可：

>>> print(df.query('date > @start_date and date < @end_date'))
      col_1       date
2  0.374775 2020-01-03

Python 选择两个日期之间的 DataFrame 行

提问by darkpool

采纳答案by unutbu

回答by Christin Jose

回答by Jonny Brooks

回答by Abhinav Anand

回答by Arraval

回答by pomber

回答by MattR

回答by Jaroslav Bezděk

相关推荐

最近更新

标签

Python 选择两个日期之间的 DataFrame 行

提问by darkpool

采纳答案by unutbu

回答by Christin Jose

回答by Jonny Brooks

回答by Abhinav Anand

回答by Arraval

回答by pomber

回答by MattR

回答by Jaroslav Bezděk

相关推荐

Python 将常量列添加到 Pandas 数据框的更好方法

Python 从 psycopg2 异常中获取错误消息

Python 如何更改matlibplot中x轴和y轴的范围？

Python 从字符串中删除长度小于 4 的单词

相关推荐

最近更新

标签