Pandas 数据框:省略周末和假期附近的日子

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48818653/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:11:38  来源:igfitidea点击:

Pandas dataframe: omit weekends and days near holidays

pythonpandas

提问by stackoverflowuser2010

I have a Pandas dataframe with a DataTimeIndex and some other columns, similar to this:

我有一个带有 DataTimeIndex 和其他一些列的 Pandas 数据框,类似于:

import pandas as pd
import numpy as np

range = pd.date_range('2017-12-01', '2018-01-05', freq='6H')
df = pd.DataFrame(index = range)

# Average speed in miles per hour
df['value'] = np.random.randint(low=0, high=60, size=len(df.index))

df.info()
# DatetimeIndex: 141 entries, 2017-12-01 00:00:00 to 2018-01-05 00:00:00
# Freq: 6H
# Data columns (total 1 columns):
# value    141 non-null int64
# dtypes: int64(1)
# memory usage: 2.2 KB

df.head(10)
#                      value
# 2017-12-01 00:00:00     15
# 2017-12-01 06:00:00     54
# 2017-12-01 12:00:00     19
# 2017-12-01 18:00:00     13
# 2017-12-02 00:00:00     35
# 2017-12-02 06:00:00     31
# 2017-12-02 12:00:00     58
# 2017-12-02 18:00:00      6
# 2017-12-03 00:00:00      8
# 2017-12-03 06:00:00     30

How can I select or filter the entries that are:

如何选择或过滤以下条目:

  1. Weekdays only (that is, not weekend days Saturday or Sunday)

  2. Not within N days of the dates in a list (e.g. U.S. holidays like '12-25' or '01-01')?

  1. 仅限工作日(即周六或周日不是周末)

  2. 不在列表中日期的 N 天内(例如美国假期,如“12-25”或“01-01”)?

I was hoping for something like:

我希望是这样的:

df = exclude_Sat_and_Sun(df)

omit_days = ['12-25', '01-01']
N = 3 # days near the holidays
df = exclude_days_near_omit_days(N, omit_days)

I was thinking of creating a new column to break out the month and day and then comparing them to the criteria for 1 and 2 above. However, I was hoping for something more Pythonic using the DateTimeIndex.

我正在考虑创建一个新列来划分月份和日期,然后将它们与上述 1 和 2 的标准进行比较。但是,我希望使用 DateTimeIndex 有更多 Pythonic 的东西。

Thanks for any help.

谢谢你的帮助。

回答by Bahman Engheta

The first part can be easily accomplished using the Pandas DatetimeIndex.dayofweekproperty, which starts counting weekdays with Monday as 0 and ending with Sunday as 6.

第一部分可以使用 PandasDatetimeIndex.dayofweek属性轻松完成,该属性从星期一开始计算工作日为 0,以星期日结束为 6。

df[df.index.dayofweek < 5]will give you only the weekdays.

df[df.index.dayofweek < 5]只会给你工作日。


For the second part you can use the datetimemodule. Below I will give an example for only one date, namely 2017-12-25. You can easily generalize it to a list of dates, for example by defining a helper function.


对于第二部分,您可以使用该datetime模块。下面我仅以一个日期为例,即2017-12-25。您可以轻松地将其概括为日期列表,例如通过定义辅助函数。

from datetime import datetime, timedelta

N = 3

df[abs(df.index.date - datetime.strptime("2017-12-25", '%Y-%m-%d').date()) > timedelta(N)]

This will give all dates that are more than N=3days away from 2017-12-25. That is, it will exclude an interval of 7 days from 2017-12-22 to 2017-12-28.

这将给出N=3距离 2017-12-25多天的所有日期。即排除2017-12-22到2017-12-28这7天的间隔。


Lastly, you can combine the two criteria using the &operator, as you probably know.


最后,&您可能知道,您可以使用运算符组合这两个条件。

df[
   (df.index.dayofweek < 5) 
   & 
   (abs(df.index.date - datetime.strptime("2017-12-25", '%Y-%m-%d').date()) > timedelta(N))
  ]

回答by stackoverflowuser2010

I followed the answer by @Bahman Engheta and created a function to omit dates from a dataframe.

我按照@Bahman Engheta 的回答创建了一个函数来省略数据帧中的日期。

import pandas as pd
from datetime import datetime, timedelta

def omit_dates(df, list_years, list_dates, omit_days_near=3, omit_weekends=False):
    '''
    Given a Pandas dataframe with a DatetimeIndex, remove rows that have a date
    near a given list of dates and/or a date on a weekend.

    Parameters:
    ----------

    df : Pandas dataframe

    list_years : list of str
        Contains a list of years in string form
    list_dates : list of str
        Contains a list of dates in string form encoded as MM-DD
    omit_days_near : int
        Threshold of days away from list_dates to remove. For example, if
        omit_days_near=3, then omit all days that are 3 days away from 
        any date in list_dates.
    omit_weekends : bool
        If true, omit dates that are on weekends.

    Returns:
    -------
    Pandas dataframe
        New resulting dataframe with dates omitted.
    '''

    if not isinstance(df, pd.core.frame.DataFrame):
        raise ValueError("df is expected to be a Pandas dataframe, not %s" % type(df).__name__)

    if not isinstance(df.index, pd.tseries.index.DatetimeIndex):
        raise ValueError("Dataframe is expected to have an index of DateTimeIndex, not %s" %
                         type(df.index).__name__)

    if not isinstance(list_years, list):
        list_years = [list_years]

    if not isinstance(list_dates, list):
        list_dates = [list_dates]

    result = df.copy()

    if omit_weekends:
        result = result.loc[result.index.dayofweek < 5]

    omit_dates = [ '%s-%s' % (year, date) for year in list_years for date in list_dates ]

    for date in omit_dates:
        result = result.loc[abs(result.index.date - datetime.strptime(date, '%Y-%m-%d').date()) > timedelta(omit_days_near)]

    return result

Here is example usage. Suppose you have a dataframe that has a DateTimeIndex and other columns, like this:

这是示例用法。假设您有一个包含 DateTimeIndex 和其他列的数据框,如下所示:

import pandas as pd
import numpy as np

range = pd.date_range('2017-12-01', '2018-01-05', freq='1D')
df = pd.DataFrame(index = range)

df['value'] = np.random.randint(low=0, high=60, size=len(df.index))

The resulting dataframe looks like this:

生成的数据框如下所示:

            value
2017-12-01     42
2017-12-02     35
2017-12-03     49
2017-12-04     25
2017-12-05     19
2017-12-06     28
2017-12-07     21
2017-12-08     57
2017-12-09      3
2017-12-10     57
2017-12-11     46
2017-12-12     20
2017-12-13      7
2017-12-14      5
2017-12-15     30
2017-12-16     57
2017-12-17      4
2017-12-18     46
2017-12-19     32
2017-12-20     48
2017-12-21     55
2017-12-22     52
2017-12-23     45
2017-12-24     34
2017-12-25     42
2017-12-26     33
2017-12-27     17
2017-12-28      2
2017-12-29      2
2017-12-30     51
2017-12-31     19
2018-01-01      6
2018-01-02     43
2018-01-03     11
2018-01-04     45
2018-01-05     45

Now, let's specify dates to remove. I want to remove the dates '12-10', '12-25', '12-31', and '01-01' (following MM-DD notation) and all dates within 2 days of those dates. Further, I want to remove those dates from both the years '2016' and '2017'. I also want to remove weekend dates.

现在,让我们指定要删除的日期。我想删除日期“12-10”、“12-25”、“12-31”和“01-01”(遵循 MM-DD 表示法)以及这些日期后 2 天内的所有日期。此外,我想从“2016”和“2017”年中删除这些日期。我还想删除周末日期。

I'll call my function like this:

我会像这样调用我的函数:

years = ['2016', '2017']
holiday_dates = ['12-10', '12-25', '12-31', '01-01']
omit_dates(df, years, holiday_dates, omit_days_near=2, omit_weekends=True)

The result is:

结果是:

            value
2017-12-01     42
2017-12-04     25
2017-12-05     19
2017-12-06     28
2017-12-07     21
2017-12-13      7
2017-12-14      5
2017-12-15     30
2017-12-18     46
2017-12-19     32
2017-12-20     48
2017-12-21     55
2017-12-22     52
2017-12-28      2
2018-01-03     11
2018-01-04     45
2018-01-05     45

Is that answer correct? Here are the calendars for December 2017 and January 2018:

这个答案正确吗?以下是 2017 年 12 月和 2018 年 1 月的日历:

   December 2017      
Su Mo Tu We Th Fr Sa  
                1  2  
 3  4  5  6  7  8  9  
10 11 12 13 14 15 16  
17 18 19 20 21 22 23  
24 25 26 27 28 29 30  
31   

    January 2018      
Su Mo Tu We Th Fr Sa  
    1  2  3  4  5  6  
 7  8  9 10 11 12 13  
14 15 16 17 18 19 20  
21 22 23 24 25 26 27  
28 29 30 31   

Looks like it works.

看起来它有效。