仅在 Pandas 中转换为年份的 Python 清理日期
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24272398/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python cleaning dates for conversion to year only in Pandas
提问by ccsv
I have a large data set which some users put in data on an csv. I converted the CSV into a dataframe with panda. The column is over 1000 entries here is a sample
我有一个大数据集,一些用户将数据放在 csv 上。我将 CSV 转换为带有panda. 该列超过 1000 个条目,这里是一个示例
datestart
5/5/2013
6/12/2013
11/9/2011
4/11/2013
10/16/2011
6/15/2013
6/19/2013
6/16/2013
10/1/2011
1/8/2013
7/15/2013
7/22/2013
7/22/2013
5/5/2013
7/12/2013
7/29/2013
8/1/2013
7/22/2013
3/15/2013
6/17/2013
7/9/2013
3/5/2013
5/10/2013
5/15/2013
6/30/2013
6/30/2013
1/1/2006
00/00/0000
7/1/2013
12/21/2009
8/14/2013
Feb 1 2013
Then I tried converting the dates into years using
然后我尝试使用将日期转换为年份
df['year']=df['datestart'].astype('timedelta64[Y]')
But it gave me an error:
但它给了我一个错误:
ValueError: Value cannot be converted into object Numpy Time delta
Using Datetime64
使用 Datetime64
df['year']=pd.to_datetime(df['datestart']).astype('datetime64[Y]')
it gave:
它给了:
"ValueError: Error parsing datetime string ""03/13/2014"" at position 2"
Since that column was filled in by users, the majority was in this format MM/DD/YYYY but some data was put in like this: Feb 10 2013 and there was one entry like this 00/00/0000. I am guessing the different formats screwed up the processing.
由于该列是由用户填写的,因此大多数采用这种格式 MM/DD/YYYY 但有些数据是这样输入的:2013 年 2 月 10 日,并且有一个像 00/00/0000 这样的条目。我猜不同的格式搞砸了处理。
Is there a try loop, if statement, or something that I can skip over problems like these?
有没有try loop, if statement, 或者什么可以跳过这些问题?
If date time fails I will be force to use a str.extractscript which also works:
如果日期时间失败,我将被迫使用str.extract同样有效的脚本:
year=df['datestart'].str.extract("(?P<month>[0-9]+)(-|\/)(?P<day>[0-9]+)(-|\/)(?P<year>[0-9]+)")
del df['month'], df['day']
and use concatto take the year out.
并用于concat取出年份。
With df['year']=pd.to_datetime(df['datestart'],coerce=True, errors ='ignore').astype('datetime64[Y]')The error message is:
随着df['year']=pd.to_datetime(df['datestart'],coerce=True, errors ='ignore').astype('datetime64[Y]')错误消息是:
Message File Name Line Position
Traceback
<module> C:\Usersdf['datestart'] = pd.to_datetime(df['datestart'], coerce=True)
\Desktop\python\Example.py 23
astype C:\Python33\lib\site-packages\pandas\core\generic.py 2062
astype C:\Python33\lib\site-packages\pandas\core\internals.py 2491
apply C:\Python33\lib\site-packages\pandas\core\internals.py 3728
astype C:\Python33\lib\site-packages\pandas\core\internals.py 1746
_astype C:\Python33\lib\site-packages\pandas\core\internals.py 470
_astype_nansafe C:\Python33\lib\site-packages\pandas\core\common.py 2222
TypeError: cannot astype a datetimelike from [datetime64[ns]] to [datetime64[Y]]
回答by joris
You first have to convert the column with the date values to datetime's with to_datetime():
您首先必须将具有日期值的列转换为日期时间to_datetime():
df['datestart'].values.astype('datetime64[Y]')
This should normally parse the different formats flexibly (the coerce=Trueis important here to convert invalid dates to NaT).
这通常应该灵活地解析不同的格式(coerce=True这里很重要将无效日期转换为NaT)。
If you then want the year part of the dates, you can do the following (seems doing astype directly on the pandas column gives an error, but with valuesyou can get the underlying numpy array):
如果您想要日期的年份部分,您可以执行以下操作(似乎直接在 pandas 列上执行 astype 会产生错误,但values您可以获得底层 numpy 数组):
df['year'] = pd.DatetimeIndex(df['datestart']).year
The problem with this is that it gives again an error when assigning this to a column due to the NaTvalue (this seems a bug, you can solve this by doing df = df.dropna()). But also, when you assign this to a column, it get converted back to a datetime64[ns]as this is the way pandas stores datetimes. So I personally think if you want a column with the years, you can better do the following:
这样做的问题是,由于该NaT值,在将其分配给列时再次出错(这似乎是一个错误,您可以通过执行 来解决此问题df = df.dropna())。而且,当您将它分配给一列时,它会被转换回 a,datetime64[ns]因为这是 Pandas 存储日期时间的方式。所以我个人认为如果你想要一个有年份的专栏,你可以更好地做到以下几点:
This last one will return the year as an integer.
最后一个将以整数形式返回年份。

