Python 选择两个日期之间的 DataFrame 行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29370057/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Select DataFrame rows between two dates
提问by darkpool
I am creating a DataFrame from a csv as follows:
我正在从 csv 创建一个 DataFrame,如下所示:
stock = pd.read_csv('data_in/' + filename + '.csv', skipinitialspace=True)
The DataFrame has a date column. Is there a way to create a new DataFrame (or just overwrite the existing one) which only contains rows with date values that fall within a specified date range or between two specified date values?
DataFrame 有一个日期列。有没有办法创建一个新的 DataFrame(或只是覆盖现有的),它只包含日期值落在指定日期范围内或两个指定日期值之间的行?
采纳答案by unutbu
There are two possible solutions:
有两种可能的解决方案:
- Use a boolean mask, then use
df.loc[mask]
- Set the date column as a DatetimeIndex, then use
df[start_date : end_date]
- 使用布尔掩码,然后使用
df.loc[mask]
- 将日期列设置为 DatetimeIndex,然后使用
df[start_date : end_date]
Using a boolean mask:
使用布尔掩码:
Ensure df['date']
is a Series with dtype datetime64[ns]
:
确保df['date']
是一个带有 dtype 的系列datetime64[ns]
:
df['date'] = pd.to_datetime(df['date'])
Make a boolean mask. start_date
and end_date
can be datetime.datetime
s,
np.datetime64
s, pd.Timestamp
s, or even datetime strings:
制作一个布尔掩码。start_date
并且end_date
可以是datetime.datetime
s、
np.datetime64
s、pd.Timestamp
s,甚至是日期时间字符串:
#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)
Select the sub-DataFrame:
选择子数据框:
df.loc[mask]
or re-assign to df
或重新分配给 df
df = df.loc[mask]
For example,
例如,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])
yields
产量
0 1 2 date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10
Using a DatetimeIndex:
If you are going to do a lot of selections by date, it may be quicker to set the
date
column as the index first. Then you can select rows by date using
df.loc[start_date:end_date]
.
如果您要按日期进行大量选择,首先将date
列设置为索引可能会更快
。然后,您可以使用df.loc[start_date:end_date]
.按日期选择行
。
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])
yields
产量
0 1 2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337
While Python list indexing, e.g. seq[start:end]
includes start
but not end
, in contrast, Pandas df.loc[start_date : end_date]
includes bothend-points in the result if they are in the index. Neither start_date
nor end_date
has to be in the index however.
虽然 Python 列表索引,例如seq[start:end]
包括start
但不包括end
,相反,如果它们在索引中,Pandas 会在结果中df.loc[start_date : end_date]
包含两个端点。然而,既不start_date
也不end_date
必须在索引中。
Also note that pd.read_csv
has a parse_dates
parameterwhich you could use to parse the date
column as datetime64
s. Thus, if you use parse_dates
, you would not need to use df['date'] = pd.to_datetime(df['date'])
.
另请注意,pd.read_csv
有一个parse_dates
参数可用于将date
列解析为datetime64
s。因此,如果您使用parse_dates
,则不需要使用df['date'] = pd.to_datetime(df['date'])
.
回答by Christin Jose
I feel the best option will be to use the direct checks rather than using loc function:
我觉得最好的选择是使用直接检查而不是使用 loc 函数:
df = df[(df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')]
It works for me.
这个对我有用。
Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.
带有切片的 loc 函数的主要问题是限制应该存在于实际值中,否则将导致 KeyError。
回答by Jonny Brooks
You can use the isin
method on the date
column like so
df[df["date"].isin(pd.date_range(start_date, end_date))]
您可以像这样isin
在date
列
上使用该方法df[df["date"].isin(pd.date_range(start_date, end_date))]
Note:This only works with dates (as the question asks) and not timestamps.
注意:这只适用于日期(如问题所问)而不适用于时间戳。
Example:
例子:
import numpy as np
import pandas as pd
# Make a DataFrame with dates and random numbers
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
# Select the rows between two dates
in_range_df = df[df["date"].isin(pd.date_range("2017-01-15", "2017-01-20"))]
print(in_range_df) # print result
which gives
这使
0 1 2 date
14 0.960974 0.144271 0.839593 2017-01-15
15 0.814376 0.723757 0.047840 2017-01-16
16 0.911854 0.123130 0.120995 2017-01-17
17 0.505804 0.416935 0.928514 2017-01-18
18 0.204869 0.708258 0.170792 2017-01-19
19 0.014389 0.214510 0.045201 2017-01-20
回答by Abhinav Anand
Keeping the solution simple and pythonic, I would suggest you to try this.
保持解决方案简单和pythonic,我建议你试试这个。
In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.
如果您要经常这样做,最好的解决方案是首先将日期列设置为索引,这将转换 DateTimeIndex 中的列,并使用以下条件对任何日期范围进行切片。
import pandas as pd
data_frame = data_frame.set_index('date')
df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]
回答by Arraval
I prefer not to alter the df
.
我不想改变df
.
An option is to retrieve the index
of the start
and end
dates:
一种选择是检索index
的start
和end
日期:
import numpy as np
import pandas as pd
#Dummy DataFrame
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')
#Get the index of the start and end dates respectively
start = df[df['date']=='2017-01-07'].index[0]
end = df[df['date']=='2017-01-14'].index[0]
#Show the sliced df (from 2017-01-07 to 2017-01-14)
df.loc[start:end]
which results in:
这导致:
0 1 2 date
6 0.5 0.8 0.8 2017-01-07
7 0.0 0.7 0.3 2017-01-08
8 0.8 0.9 0.0 2017-01-09
9 0.0 0.2 1.0 2017-01-10
10 0.6 0.1 0.9 2017-01-11
11 0.5 0.3 0.9 2017-01-12
12 0.5 0.4 0.3 2017-01-13
13 0.4 0.9 0.9 2017-01-14
回答by pomber
You can also use between
:
您还可以使用between
:
df[df.some_date.between(start_date, end_date)]
回答by MattR
With my testing of pandas
version 0.22.0
you can now answer this question easier with more readable code by simply using between
.
通过我对pandas
版本的测试,0.22.0
您现在只需使用between
.
# create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019
df = pd.DataFrame({'dates':pd.date_range('2018-01-01','2019-01-01')})
Let's say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019:
假设您想获取 2018 年 11 月 27 日和 2019 年 1 月 15 日之间的日期:
# use the between statement to get a boolean mask
df['dates'].between('2018-11-27','2019-01-15', inclusive=False)
0 False
1 False
2 False
3 False
4 False
# you can pass this boolean mask straight to loc
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=False)]
dates
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
335 2018-12-02
Notice the inclusive argument. veryhelpful when you want to be explicit about your range. notice when set to True we return Nov 27th of 2018 as well:
注意包含参数。当你想明确你的范围时非常有帮助。请注意,当设置为 True 时,我们也会返回 2018 年 11 月 27 日:
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
dates
330 2018-11-27
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
This method is also faster than the previously mentioned isin
method:
这种方法也比前面提到的isin
方法更快:
%%timeit -n 5
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
868 μs ± 164 μs per loop (mean ± std. dev. of 7 runs, 5 loops each)
%%timeit -n 5
df.loc[df['dates'].isin(pd.date_range('2018-01-01','2019-01-01'))]
1.53 ms ± 305 μs per loop (mean ± std. dev. of 7 runs, 5 loops each)
However, it is notfaster than the currently accepted answer, provided by unutbu, only if the mask is already created. but if the mask is dynamic and needs to be reassigned over and over, my method maybe more efficient:
但是,它并不比 unutbu 提供的当前接受的答案快,前提是已经创建了掩码。但是如果掩码是动态的并且需要一遍又一遍地重新分配,我的方法可能更有效:
# already create the mask THEN time the function
start_date = dt.datetime(2018,11,27)
end_date = dt.datetime(2019,1,15)
mask = (df['dates'] > start_date) & (df['dates'] <= end_date)
%%timeit -n 5
df.loc[mask]
191 μs ± 28.5 μs per loop (mean ± std. dev. of 7 runs, 5 loops each)
回答by Jaroslav Bezděk
Another option, how to achieve this, is by using pandas.DataFrame.query()
method. Let me show you an example on the following data frame called df
.
另一种选择,如何实现这一点,是通过使用pandas.DataFrame.query()
方法。让我向您展示以下名为df
.
>>> df = pd.DataFrame(np.random.random((5, 1)), columns=['col_1'])
>>> df['date'] = pd.date_range('2020-1-1', periods=5, freq='D')
>>> print(df)
col_1 date
0 0.015198 2020-01-01
1 0.638600 2020-01-02
2 0.348485 2020-01-03
3 0.247583 2020-01-04
4 0.581835 2020-01-05
As an argument, use the condition for filtering like this:
作为参数,使用条件进行过滤,如下所示:
>>> start_date, end_date = '2020-01-02', '2020-01-04'
>>> print(df.query('date >= @start_date and date <= @end_date'))
col_1 date
1 0.244104 2020-01-02
2 0.374775 2020-01-03
3 0.510053 2020-01-04
If you do not want to include boundaries, just change the condition like following:
如果您不想包含边界,只需更改如下条件即可:
>>> print(df.query('date > @start_date and date < @end_date'))
col_1 date
2 0.374775 2020-01-03