仅在 Pandas 中转换为年份的 Python 清理日期

Question

提问by ccsv

I have a large data set which some users put in data on an csv. I converted the CSV into a dataframe with panda. The column is over 1000 entries here is a sample

我有一个大数据集，一些用户将数据放在 csv 上。我将 CSV 转换为带有panda. 该列超过 1000 个条目，这里是一个示例

datestart
5/5/2013
6/12/2013
11/9/2011
4/11/2013
10/16/2011
6/15/2013
6/19/2013
6/16/2013
10/1/2011
1/8/2013
7/15/2013
7/22/2013
7/22/2013
5/5/2013
7/12/2013
7/29/2013
8/1/2013
7/22/2013
3/15/2013
6/17/2013
7/9/2013
3/5/2013
5/10/2013
5/15/2013
6/30/2013
6/30/2013
1/1/2006
00/00/0000
7/1/2013
12/21/2009
8/14/2013
Feb 1 2013

Then I tried converting the dates into years using

然后我尝试使用将日期转换为年份

df['year']=df['datestart'].astype('timedelta64[Y]')

But it gave me an error:

但它给了我一个错误：

ValueError: Value cannot be converted into object Numpy Time delta

Using Datetime64

使用 Datetime64

df['year']=pd.to_datetime(df['datestart']).astype('datetime64[Y]')

it gave:

它给了：

"ValueError: Error parsing datetime string ""03/13/2014"" at position 2"

Since that column was filled in by users, the majority was in this format MM/DD/YYYY but some data was put in like this: Feb 10 2013 and there was one entry like this 00/00/0000. I am guessing the different formats screwed up the processing.

由于该列是由用户填写的，因此大多数采用这种格式 MM/DD/YYYY 但有些数据是这样输入的：2013 年 2 月 10 日，并且有一个像 00/00/0000 这样的条目。我猜不同的格式搞砸了处理。

Is there a try loop, if statement, or something that I can skip over problems like these?

有没有try loop, if statement, 或者什么可以跳过这些问题？

If date time fails I will be force to use a str.extractscript which also works:

如果日期时间失败，我将被迫使用str.extract同样有效的脚本：

year=df['datestart'].str.extract("(?P<month>[0-9]+)(-|\/)(?P<day>[0-9]+)(-|\/)(?P<year>[0-9]+)")


del df['month'], df['day']

and use concatto take the year out.

并用于concat取出年份。

With df['year']=pd.to_datetime(df['datestart'],coerce=True, errors ='ignore').astype('datetime64[Y]')The error message is:

随着df['year']=pd.to_datetime(df['datestart'],coerce=True, errors ='ignore').astype('datetime64[Y]')错误消息是：

Message File Name   Line    Position    
Traceback               
    <module>    C:\Usersdf['datestart'] = pd.to_datetime(df['datestart'], coerce=True)
\Desktop\python\Example.py    23      
    astype  C:\Python33\lib\site-packages\pandas\core\generic.py    2062        
    astype  C:\Python33\lib\site-packages\pandas\core\internals.py  2491        
    apply   C:\Python33\lib\site-packages\pandas\core\internals.py  3728        
    astype  C:\Python33\lib\site-packages\pandas\core\internals.py  1746        
    _astype C:\Python33\lib\site-packages\pandas\core\internals.py  470     
    _astype_nansafe C:\Python33\lib\site-packages\pandas\core\common.py 2222        
TypeError: cannot astype a datetimelike from [datetime64[ns]] to [datetime64[Y]]

Answer 1

回答by joris

You first have to convert the column with the date values to datetime's with to_datetime():

您首先必须将具有日期值的列转换为日期时间to_datetime()：

df['datestart'].values.astype('datetime64[Y]')

This should normally parse the different formats flexibly (the coerce=Trueis important here to convert invalid dates to NaT).

这通常应该灵活地解析不同的格式（coerce=True这里很重要将无效日期转换为NaT）。

If you then want the year part of the dates, you can do the following (seems doing astype directly on the pandas column gives an error, but with valuesyou can get the underlying numpy array):

如果您想要日期的年份部分，您可以执行以下操作（似乎直接在 pandas 列上执行 astype 会产生错误，但values您可以获得底层 numpy 数组）：

df['year'] =  pd.DatetimeIndex(df['datestart']).year

The problem with this is that it gives again an error when assigning this to a column due to the NaTvalue (this seems a bug, you can solve this by doing df = df.dropna()). But also, when you assign this to a column, it get converted back to a datetime64[ns]as this is the way pandas stores datetimes. So I personally think if you want a column with the years, you can better do the following:

这样做的问题是，由于该NaT值，在将其分配给列时再次出错（这似乎是一个错误，您可以通过执行来解决此问题df = df.dropna()）。而且，当您将它分配给一列时，它会被转换回 a，datetime64[ns]因为这是 Pandas 存储日期时间的方式。所以我个人认为如果你想要一个有年份的专栏，你可以更好地做到以下几点：

##代码##

This last one will return the year as an integer.

最后一个将以整数形式返回年份。

仅在 Pandas 中转换为年份的 Python 清理日期

提问by ccsv

回答by joris

相关推荐

最近更新

标签

仅在 Pandas 中转换为年份的 Python 清理日期

提问by ccsv

回答by joris

相关推荐

Python Pandas - 使用 to_sql 以块的形式写入大型数据帧

pandas 关于特定列的逐行填充？

pandas 熊猫替换非零值

在 Pandas 中展平系列，即元素为列表的系列

相关推荐

最近更新

标签