pandas 熊猫 to_datetime 解析错误的年份

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37766353/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:22:20  来源:igfitidea点击:

pandas to_datetime parsing wrong year

pythondatetimepandas

提问by dan_g

I'm coming across something that is almost certainly a stupid mistake on my part, but I can't seem to figure out what's going on.

我遇到了一些几乎可以肯定是我犯的愚蠢错误,但我似乎无法弄清楚发生了什么。

Essentially, I have a series of dates as strings in the format "%d-%b-%y", such as 26-Sep-05. When I go to convert them to datetime, the year is sometimes correct, but sometimes it is not.

本质上,我有一系列日期作为格式的字符串"%d-%b-%y",例如26-Sep-05. 当我将它们转换为日期时间时,年份有时是正确的,但有时则不是。

E.g.:

例如:

dates = ['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55']

pd.to_datetime(dates, format="%d-%b-%y")
DatetimeIndex(['2005-09-26', '2005-09-26', '1970-06-15', '1994-12-05',
               '2061-01-09', '2055-02-08'],
              dtype='datetime64[ns]', freq=None)

The last two entries, which get returned as 2061 and 2055 for the years, are wrong. But this works fine for the 15-Jun-70entry. What's going on here?

多年来返回的最后两个条目是 2061 和 2055,这是错误的。但这适用于15-Jun-70条目。这里发生了什么?

回答by bakkal

That seems to be the behavior of the Python library datetime, I did a test to see where the cutoff is 68 - 69:

这似乎是 Python 库 datetime 的行为,我做了一个测试,看看截止点在哪里 68 - 69:

datetime.datetime.strptime('31-Dec-68', '%d-%b-%y').date()
>>> datetime.date(2068, 12, 31)

datetime.datetime.strptime('1-Jan-69', '%d-%b-%y').date()
>>> datetime.date(1969, 1, 1)

Two digits year ambiguity

两位数年份歧义

So it seems that anything with the %y year below 69 will be attributed a century of 2000, and 69 upwards get 1900

因此,似乎 %y 年份低于 69 的任何事物都将归因于 2000 世纪,而 69 向上则是 1900

The %ytwo digits can only go from 00to 99which is going to be ambiguous if we start crossing centuries.

%y两位数只能去0099这将是不明确的,如果我们开始穿越百年。

If there is no overlap, you could manually process it and annotate the century (kill the ambiguity)

如果没有重叠,您可以手动处理它并注释世纪(消除歧义)

I suggest you process your data manually and specify the century, e.g. you can decide that anything in your data that has the year between 17 and 68 is attributed to 1917 - 1968 (instead of 2017 - 2068).

我建议您手动处理数据并指定世纪,例如,您可以决定数据中年份介于 17 到 68 之间的任何内容都归因于 1917 - 1968(而不是 2017 - 2068)。

If you have overlap then you can't process with insufficient year information, unless e.g. you have some ordered data and a reference

如果您有重叠,那么您无法在年份信息不足的情况下进行处理,除非例如您有一些有序数据和参考

If you have overlap e.g. you have data from both 2016 and 1916 and both were logged as '16', that's ambiguous and there isn't sufficient information to parse this, unless the data is ordered by date in which case you can use heuristics to switch the century as you parse it.

如果您有重叠,例如您有 2016 年和 1916 年的数据,并且都记录为“16”,则这是模棱两可的,并且没有足够的信息来解析它,除非数据按日期排序,在这种情况下,您可以使用启发式方法解析时切换世纪。

回答by Coquelicot

For anyone looking for a quick and dirty code snippet to fix these cases, this worked for me:

对于任何寻找快速而肮脏的代码片段来解决这些情况的人来说,这对我有用:

from datetime import timedelta, date
col = 'date'
df[col] = pd.to_datetime(df[col])
future = df[col] > date(year=2050,month=1,day=1)
df.loc[future, col] -= timedelta(days=365.25*100)

You may need to tune the threshold date closer to the present depending on the earliest dates in your data.

您可能需要根据数据中的最早日期将阈值日期调整得更接近当前。

回答by MaxU

from the docs

文档

Year 2000 (Y2K) issues:Python depends on the platform's C library, which generally doesn't have year 2000 issues, since all dates and times are represented internally as seconds since the epoch. Function strptime() can parse 2-digit years when given %y format code. When 2-digit years are parsed, they are converted according to the POSIX and ISO C standards: values 69–99are mapped to 1969–1999, and values 0–68are mapped to 2000–2068.

2000 年 (Y2K) 问题:Python 依赖于平台的 C 库,该库通常没有 2000 年问题,因为所有日期和时间都在内部表示为自纪元以来的秒数。当给定 %y 格式代码时,函数 strptime() 可以解析 2 位数的年份。解析 2 位数年份时,它们会根据 POSIX 和 ISO C 标准进行转换:值69–99映射到1969–1999,值 0–68映射到2000–2068

回答by Rupanjan Nayak

Another quick solution to the problem:-

该问题的另一个快速解决方案:-

import pandas as pd
import numpy as np
dates = pd.DataFrame(['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55'])

for i in dates:
    tempyear=pd.to_numeric(dates[i].str[-2:])
    dates["temp_year"]=np.where((tempyear>=44)&(tempyear<=99),tempyear+1900,tempyear+2000).astype(str)
    dates["temp_month"]=dates[i].str[:-2]
    dates["temp_flyr"]=dates["temp_month"]+dates["temp_year"]
    dates["pddt"]=pd.to_datetime(dates.temp_flyr.str.upper(), format='%d-%b-%Y', yearfirst=False)
    tempdrops=["temp_year","temp_month","temp_flyr",i]
    dates.drop(tempdrops, axis=1, inplace=True)

And the output is as follows, here I have converted the output to pandas datetime format from object using pd.to_datetime

输出如下,这里我使用pd.to_datetime将输出从 object 转换为 pandas datetime 格式

    pddt
0   2005-09-26
1   2005-09-26
2   1970-06-15
3   1994-12-05
4   1961-01-09
5   1955-02-08

As mentioned in some other answers this works best if there is no overlapbetween the dates of the two centuries.

正如其他一些答案中提到的,如果两个世纪的日期之间没有重叠,这种方法效果最好。

回答by Himanshu Arora

You can write a simple function to correct this parsing of wrong year as stated below:

您可以编写一个简单的函数来纠正错误年份的解析,如下所述:

import datetime

def fix_date(x):

    if x.year > 1989:

        year = x.year - 100

    else:

        year = x.year

    return datetime.date(year,x.month,x.day)


df['date_column'] = data['date_column'].apply(fix_date)

Hope this helps..

希望这可以帮助..