python-pandas：处理熊猫数据帧日期列中的 NaT 类型值

Question

提问by Satya

I have a dataframe with mixed datatype column, and I applied pd.to_datetime(df['DATE'],coerce=True)and got the below dataframe

我有一个带有混合数据类型列的数据框，我申请pd.to_datetime(df['DATE'],coerce=True)并获得了以下数据框

CUSTOMER_name     DATE
 abc                 NaT
 def                 NaT
 abc               2010-04-15 19:09:08
 def               2011-01-25 15:29:37
 abc               2010-04-10 12:29:02

Now I want to apply some agg function (here i want to groupby mailid and take min() of Date to find that mailid's date of first transaction).

现在我想应用一些 agg 函数（在这里我想对 mailid 进行分组，并使用 Date 的 min() 来查找该 mailid 的第一次交易日期）。

df['DATE'] = [x.date() for x in df['DATE']]
#Here the value goes to 
 CUSTOMER_name     DATE
 abc               0001-255-255 ####how??
 def               0001-255-255  ###How??
 abc               2010-04-15
 def               2011-01-25
 abc               2010-04-10
#Then when i do a groupby and applying min on DATE
df.groupby('CUSTOMER_name')['DATE'].min()
#CUSTOMER_name     DATE
 abc               0001-255-255 ####i want 2010-04-10
 def               0001-255-255  ### i want 2011-01-25

SO can anyone please suggest , how to deal with this NaT while converting to date() and while doing groupby and min(), how to exclude NaT for calculation.

任何人都可以请建议，如何在转换为 date() 以及在执行 groupby 和 min() 时处理此 NaT，如何排除 NaT 进行计算。

if for any customer_name only NaT will be there in DATE field, then on groupby and min(), I am okay with nan or Null values.

如果对于任何 customer_name，DATE 字段中只有 NaT，那么在 groupby 和 min() 上，我可以使用 nan 或 Null 值。

Answer 1

采纳答案by Ami Tavory

Say you start with something like this:

假设你从这样的事情开始：

df = pd.DataFrame({
    'CUSTOMER_name': ['abc', 'def', 'abc', 'def', 'abc', 'fff'], 
    'DATE': ['NaT', 'NaT', '2010-04-15 19:09:08', '2011-01-25 15:29:37', '2010-04-10 12:29:02', 'NaT']})
df.DATE = pd.to_datetime(df.DATE)

(note that the only difference is adding fffmapped to NaT).

（请注意，唯一的区别是添加fff映射到NaT）。

Then the following does what you ask:

然后执行以下操作：

>>> pd.to_datetime(df.DATE.groupby(df.CUSTOMER_name).min())
CUSTOMER_name
abc   2010-04-10 12:29:02
def   2011-01-25 15:29:37
fff                   NaT
Name: DATE, dtype: datetime64[ns]

This is because groupby-minalready excludes missing data where applicable (albeit changing the format of the results), and the final pd.to_datetimecoerces the result again to a datetime.

这是因为groupby-min已经在适用的情况下排除了丢失的数据（尽管更改了结果的格式），并且最终pd.to_datetime将结果再次强制为datetime.

To get the date part of the result (which I think is a separate question), use .dt.date:

要获得结果的日期部分（我认为这是一个单独的问题），请使用.dt.date：

>>> pd.to_datetime(df.DATE.groupby(df.CUSTOMER_name).min()).dt.date
Out[19]: 
CUSTOMER_name
abc    2010-04-10
def    2011-01-25
fff           NaN
Name: DATE, dtype: object

Answer 2

回答by MaxU

Here is an alternative solution:

这是一个替代解决方案：

Data:

数据：

In [96]: x
Out[96]:
  CUSTOMER_name                 DATE
0           abc                    T
1           def                    N
2           abc  2010-04-15 19:09:08
3           def  2011-01-25 15:29:37
4           abc  2010-04-10 12:29:02
5           fff                   sa

Solution:

解决方案：

In [100]: (x.assign(D=pd.to_datetime(x.DATE, errors='coerce').values.astype('<M8[D]'))
   .....:   .groupby('CUSTOMER_name')['D']
   .....:   .min()
   .....:   .astype('datetime64[ns]')
   .....: )
Out[100]:
CUSTOMER_name
abc   2010-04-10
def   2011-01-25
fff          NaT
Name: D, dtype: datetime64[ns]

Explanation:

解释：

first, let's create a new virtual column Dwith truncated time part:

首先，让我们创建一个D带有截断时间部分的新虚拟列：

In [97]: x.assign(D=pd.to_datetime(x.DATE, errors='coerce').values.astype('<M8[D]'))
Out[97]:
  CUSTOMER_name                 DATE          D
0           abc                    T        NaT
1           def                    N        NaT
2           abc  2010-04-15 19:09:08 2010-04-15
3           def  2011-01-25 15:29:37 2011-01-25
4           abc  2010-04-10 12:29:02 2010-04-10
5           fff                   sa        NaT

now we can group by CUSTOMER_nameand calclulate minimum Dfor each group:

现在我们可以分组CUSTOMER_name并计算D每个组的最小值：

In [101]: x.assign(D=pd.to_datetime(x.DATE, errors='coerce').values.astype('<M8[D]')).groupby('CUSTOMER_name')['D'].min()
Out[101]:
CUSTOMER_name
abc    1.270858e+18
def    1.295914e+18
fff             NaN
Name: D, dtype: float64

and finally convert resulting column to datetime64[ns]dtype:

最后将结果列转换为datetime64[ns]dtype：

In [102]: (x.assign(D=pd.to_datetime(x.DATE, errors='coerce').values.astype('<M8[D]'))
   .....:   .groupby('CUSTOMER_name')['D']
   .....:   .min()
   .....:   .astype('datetime64[ns]')
   .....: )
Out[102]:
CUSTOMER_name
abc   2010-04-10
def   2011-01-25
fff          NaT
Name: D, dtype: datetime64[ns]

python-pandas：处理熊猫数据帧日期列中的 NaT 类型值

提问by Satya

if for any customer_name only NaT will be there in DATE field, then on groupby and min(), I am okay with nan or Null values.

如果对于任何 customer_name，DATE 字段中只有 NaT，那么在 groupby 和 min() 上，我可以使用 nan 或 Null 值。

采纳答案by Ami Tavory

回答by MaxU

相关推荐

最近更新

标签

python-pandas：处理熊猫数据帧日期列中的 NaT 类型值

提问by Satya

if for any customer_name only NaT will be there in DATE field, then on groupby and min(), I am okay with nan or Null values.

如果对于任何 customer_name，DATE 字段中只有 NaT，那么在 groupby 和 min() 上，我可以使用 nan 或 Null 值。

采纳答案by Ami Tavory

回答by MaxU

相关推荐

多选的 Pandas read_sql 查询

pandas 将多列拆分为熊猫数据框中的行

如何比较 Pandas 中两个数据框的值？

如何在 Pandas DataFrame 上添加列标签

相关推荐

最近更新

标签