python-pandas:处理熊猫数据帧日期列中的 NaT 类型值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38812020/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
python-pandas: dealing with NaT type values in a date columns of pandas dataframe
提问by Satya
I have a dataframe with mixed datatype column, and I applied pd.to_datetime(df['DATE'],coerce=True)
and got the below dataframe
我有一个带有混合数据类型列的数据框,我申请pd.to_datetime(df['DATE'],coerce=True)
并获得了以下数据框
CUSTOMER_name DATE
abc NaT
def NaT
abc 2010-04-15 19:09:08
def 2011-01-25 15:29:37
abc 2010-04-10 12:29:02
Now I want to apply some agg function (here i want to groupby mailid and take min() of Date to find that mailid's date of first transaction).
现在我想应用一些 agg 函数(在这里我想对 mailid 进行分组,并使用 Date 的 min() 来查找该 mailid 的第一次交易日期)。
df['DATE'] = [x.date() for x in df['DATE']]
#Here the value goes to
CUSTOMER_name DATE
abc 0001-255-255 ####how??
def 0001-255-255 ###How??
abc 2010-04-15
def 2011-01-25
abc 2010-04-10
#Then when i do a groupby and applying min on DATE
df.groupby('CUSTOMER_name')['DATE'].min()
#CUSTOMER_name DATE
abc 0001-255-255 ####i want 2010-04-10
def 0001-255-255 ### i want 2011-01-25
SO can anyone please suggest , how to deal with this NaT while converting to date() and while doing groupby and min(), how to exclude NaT for calculation.
任何人都可以请建议,如何在转换为 date() 以及在执行 groupby 和 min() 时处理此 NaT,如何排除 NaT 进行计算。
if for any customer_name only NaT will be there in DATE field, then on groupby and min(), I am okay with nan or Null values.
如果对于任何 customer_name,DATE 字段中只有 NaT,那么在 groupby 和 min() 上,我可以使用 nan 或 Null 值。
采纳答案by Ami Tavory
Say you start with something like this:
假设你从这样的事情开始:
df = pd.DataFrame({
'CUSTOMER_name': ['abc', 'def', 'abc', 'def', 'abc', 'fff'],
'DATE': ['NaT', 'NaT', '2010-04-15 19:09:08', '2011-01-25 15:29:37', '2010-04-10 12:29:02', 'NaT']})
df.DATE = pd.to_datetime(df.DATE)
(note that the only difference is adding fff
mapped to NaT
).
(请注意,唯一的区别是添加fff
映射到NaT
)。
Then the following does what you ask:
然后执行以下操作:
>>> pd.to_datetime(df.DATE.groupby(df.CUSTOMER_name).min())
CUSTOMER_name
abc 2010-04-10 12:29:02
def 2011-01-25 15:29:37
fff NaT
Name: DATE, dtype: datetime64[ns]
This is because groupby
-min
already excludes missing data where applicable (albeit changing the format of the results), and the final pd.to_datetime
coerces the result again to a datetime
.
这是因为groupby
-min
已经在适用的情况下排除了丢失的数据(尽管更改了结果的格式),并且最终pd.to_datetime
将结果再次强制为datetime
.
To get the date part of the result (which I think is a separate question), use .dt.date
:
要获得结果的日期部分(我认为这是一个单独的问题),请使用.dt.date
:
>>> pd.to_datetime(df.DATE.groupby(df.CUSTOMER_name).min()).dt.date
Out[19]:
CUSTOMER_name
abc 2010-04-10
def 2011-01-25
fff NaN
Name: DATE, dtype: object
回答by MaxU
Here is an alternative solution:
这是一个替代解决方案:
Data:
数据:
In [96]: x
Out[96]:
CUSTOMER_name DATE
0 abc T
1 def N
2 abc 2010-04-15 19:09:08
3 def 2011-01-25 15:29:37
4 abc 2010-04-10 12:29:02
5 fff sa
Solution:
解决方案:
In [100]: (x.assign(D=pd.to_datetime(x.DATE, errors='coerce').values.astype('<M8[D]'))
.....: .groupby('CUSTOMER_name')['D']
.....: .min()
.....: .astype('datetime64[ns]')
.....: )
Out[100]:
CUSTOMER_name
abc 2010-04-10
def 2011-01-25
fff NaT
Name: D, dtype: datetime64[ns]
Explanation:
解释:
first, let's create a new virtual column D
with truncated time part:
首先,让我们创建一个D
带有截断时间部分的新虚拟列:
In [97]: x.assign(D=pd.to_datetime(x.DATE, errors='coerce').values.astype('<M8[D]'))
Out[97]:
CUSTOMER_name DATE D
0 abc T NaT
1 def N NaT
2 abc 2010-04-15 19:09:08 2010-04-15
3 def 2011-01-25 15:29:37 2011-01-25
4 abc 2010-04-10 12:29:02 2010-04-10
5 fff sa NaT
now we can group by CUSTOMER_name
and calclulate minimum D
for each group:
现在我们可以分组CUSTOMER_name
并计算D
每个组的最小值:
In [101]: x.assign(D=pd.to_datetime(x.DATE, errors='coerce').values.astype('<M8[D]')).groupby('CUSTOMER_name')['D'].min()
Out[101]:
CUSTOMER_name
abc 1.270858e+18
def 1.295914e+18
fff NaN
Name: D, dtype: float64
and finally convert resulting column to datetime64[ns]
dtype:
最后将结果列转换为datetime64[ns]
dtype:
In [102]: (x.assign(D=pd.to_datetime(x.DATE, errors='coerce').values.astype('<M8[D]'))
.....: .groupby('CUSTOMER_name')['D']
.....: .min()
.....: .astype('datetime64[ns]')
.....: )
Out[102]:
CUSTOMER_name
abc 2010-04-10
def 2011-01-25
fff NaT
Name: D, dtype: datetime64[ns]