Python 熊猫从日期中获取年龄(例如:出生日期)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26788854/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:01:20  来源:igfitidea点击:

Pandas get the age from a date (example: date of birth)

pythonpandas

提问by Dave

How can I calculate the age of a person (based off the dob column) and add a column to the dataframe with the new value?

如何计算一个人的年龄(基于 dob 列)并使用新值向数据框中添加一列?

dataframe looks like the following:

数据框如下所示:

    lname      fname     dob
0    DOE       LAURIE    03011979
1    BOURNE    JASON     06111978
2    GRINCH    XMAS      12131988
3    DOE       JOHN      11121986

I tried doing the following:

我尝试执行以下操作:

now = datetime.now()
df1['age'] = now - df1['dob']

But, received the following error:

但是,收到以下错误:

TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'str'

类型错误:不支持的操作数类型 -:'datetime.datetime' 和 'str'

采纳答案by unutbu

import datetime as DT
import io
import numpy as np
import pandas as pd

pd.options.mode.chained_assignment = 'warn'

content = '''     ssno        lname         fname    pos_title             ser  gender  dob 
0    23456789    PLILEY     JODY        BUDG ANAL             0560  F      031871 
1    987654321   NOEL       HEATHER     PRTG SRVCS SPECLST    1654  F      120852
2    234567891   SONJU      LAURIE      SUPVY CONTR SPECLST   1102  F      010999
3    345678912   MANNING    CYNTHIA     SOC SCNTST            0101  F      081692
4    456789123   NAUERTZ    ELIZABETH   OFF AUTOMATION ASST   0326  F      031387'''

df = pd.read_csv(io.StringIO(content), sep='\s{2,}')
df['dob'] = df['dob'].apply('{:06}'.format)

now = pd.Timestamp('now')
df['dob'] = pd.to_datetime(df['dob'], format='%m%d%y')    # 1
df['dob'] = df['dob'].where(df['dob'] < now, df['dob'] -  np.timedelta64(100, 'Y'))   # 2
df['age'] = (now - df['dob']).astype('<m8[Y]')    # 3
print(df)

yields

产量

        ssno    lname      fname            pos_title   ser gender  \
0   23456789   PLILEY       JODY            BUDG ANAL   560      F   
1  987654321     NOEL    HEATHER   PRTG SRVCS SPECLST  1654      F   
2  234567891    SONJU     LAURIE  SUPVY CONTR SPECLST  1102      F   
3  345678912  MANNING    CYNTHIA           SOC SCNTST   101      F   
4  456789123  NAUERTZ  ELIZABETH  OFF AUTOMATION ASST   326      F   

                  dob  age  
0 1971-03-18 00:00:00   43  
1 1952-12-08 18:00:00   61  
2 1999-01-09 00:00:00   15  
3 1992-08-16 00:00:00   22  
4 1987-03-13 00:00:00   27  


  1. It looks like your dobcolumn are currently strings. First, convert them to Timestampsusing pd.to_datetime.
  2. The format '%m%d%y'converts the last two digits to years, but unfortunately assumes 52means 2052. Since that's probably not Heather Noel's birthyear, let's subtract 100 years from dobwhenever the dobis greater than now. You may want to subtract a few years to nowin the condition df['dob'] < nowsince it may be slightly more likely to have a 101 year old worker than a 1 year old worker...
  3. You can subtractdobfrom nowto obtain timedelta64[ns]. To convert that to years, use astype('<m8[Y]')or astype('timedelta64[Y]').
  1. 看起来您的dob列当前是字符串。首先,将它们转换为Timestamps使用pd.to_datetime.
  2. 该格式'%m%d%y'将最后两位数字转换为年份,但不幸的是假定52意味着 2052。由于这可能不是 Heather Noel 的出生年份,让我们从大于 的dob任何时候减去 100 年。您可能想要减去几年的条件,因为 101 岁的工人比 1 岁的工人更有可能……dobnownowdf['dob'] < now
  3. 您可以减去dobnow获得timedelta64 [NS] 。要将其转换为年,请使用astype('<m8[Y]')astype('timedelta64[Y]')

回答by Brandon Humpert

First thought is that your years are two digit, which is a not great choice in this day and age. In any case, I'm going to assume that all years like 05are actually 1905. This is probably not correct(!) but coming up with the right rule is going to depend a lot on your data.

第一个想法是你的年龄是两位数,这在这个时代不是一个很好的选择。无论如何,我将假设所有年份05实际上都是1905. 这可能不正确(!)但是提出正确的规则将在很大程度上取决于您的数据。

from datetime import date

def age(date1, date2):
    naive_yrs = date2.year - date1.year
    if date1.replace(year=date2.year) > date2:
        correction = -1
    else:
        correction = 0
    return naive_yrs + correction

df1['age'] = df1['dob'].map(lambda x: age(date(int('19' + x[-2:]), int(x[:2]), int(x[2:-2])), date.today()))

回答by nnaqa

I found easier solution:

我找到了更简单的解决方案:

import pandas as pd
from datetime import datetime
from datetime import date

d = {'col0': [1, 2, 6], 'col1': [3, 8, 3], 'col2': ['17.02.1979',
          '11.11.1993',
          '01.08.1961']}

df = pd.DataFrame(data=d)

def calculate_age(born):
    born = datetime.strptime(born, "%d.%m.%Y").date()
    today = date.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

df['age'] = df['col6'].apply(calculate_age)
print(df)

output:

输出:

     col0  col1  col3        age
0       1     3  17.02.1979   39
1       2     8  11.11.1993   24
2       6     3  01.08.1961   57

回答by cs95

# Data setup
df

    lname   fname        dob
0     DOE  LAURIE 1979-03-01
1  BOURNE   JASON 1978-06-11
2  GRINCH    XMAS 1988-12-13
3     DOE    JOHN 1986-11-12

# Make sure to parse all datetime columns in advance
df['dob'] = pd.to_datetime(df['dob'], errors='coerce')

If you want only the year portion of the age, use @unutbu's solution. . .

如果您只想要年龄的年份部分,请使用@unutbu 的解决方案。. .

now = pd.to_datetime('now')
now
# Timestamp('2019-04-14 00:00:43.105892')

(now - df['dob']).astype('<m8[Y]') 

0    40.0
1    40.0
2    30.0
3    32.0
Name: dob, dtype: float64

Another option is to subtract the year portion and account for the month difference using

另一种选择是减去年份部分并使用

(now.year - df['dob'].dt.year) - ((now.month - df['dob'].dt.month) < 0)

0    40
1    40
2    30
3    32
Name: dob, dtype: int64


If you want the (almost) precise age (including the fractional portion), query total_secondsand divide.

如果您想要(几乎)精确的年龄(包括小数部分),请查询total_seconds并除以。

(now - df['dob']).dt.total_seconds() / (60*60*24*365.25)

0    40.120446
1    40.840501
2    30.332630
3    32.418872
Name: dob, dtype: float64