Pandas - 自动检测日期列**在运行时**

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33204500/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:04:20  来源:igfitidea点击:

Pandas - automatically detect date columns **at run time**

pythonpandasscikit-learn

提问by anthonybell

I was wondering if pandas is capable of automatically detecting which columns are datetime objects and read those columns in as dates instead of strings?

我想知道 Pandas 是否能够自动检测哪些列是日期时间对象并将这些列作为日期而不是字符串读取?

I am looking at the api and related stack overflow posts but I can't seem to figure it out.

我正在查看 api 和相关的堆栈溢出帖子,但我似乎无法弄清楚。

This is a black-box system that takes in arbitrary csv schema on production so I do not what the column names will be.

这是一个黑盒系统,它在生产中采用任意 csv 模式,所以我不知道列名是什么

This seems like it would work but you have to know which columns are date fields:

这似乎可行,但您必须知道哪些列是日期字段:

import pandas as pd

#creating the test data
df = pd.DataFrame({'0': ['a', 'b', 'c'], '1': ['2015-12-27','2015-12-28', '2015-12-29'], '2': [11,12,13]})
df.to_csv('test.csv', index=False)

#loading the test data
df = pd.read_csv('test.csv', parse_dates=True)
print df.dtypes 
# prints (object, object, int64) instead of (object,datetime, int64)

I am thinking if it cannot do this, then I can write something that:

  1. Finds columns with string type.
  2. Grab a few unique values and try to parse them.
  3. If successful then try to parse the whole column.

我在想如果它不能做到这一点,那么我可以写一些东西:

  1. 查找字符串类型的列。
  2. 获取一些唯一值并尝试解析它们。
  3. 如果成功,则尝试解析整个列。

Edit.I wrote a simple method convertDateColumnsthat will do this:

编辑。我写了一个简单的方法convertDateColumns来做到这一点:

import pandas as pd
from dateutil import parser

def convertDateColumns(self, df):
    object_cols = df.columns.values[df.dtypes.values == 'object']
    date_cols = [c for c in object_cols if testIfColumnIsDate(df[c], num_tries=3)]

    for col in date_cols:
        try:
            df[col] = pd.to_datetime(df[col], coerce=True, infer_datetime_format=True)
        except ValueError:
            pass

    return df

def testIfColumnIsDate(series, num_tries=4):
""" Test if a column contains date values.
    This can try a few times for the scenerio where a date column may have
    a couple of null or missing values but we still want to parse when
    possible (and convert those null/missing to NaD values)
"""
    if series.dtype != 'object':
        return False

    vals = set()
    for val in series:
        vals.add(val)
        if len(vals) > num_tries:
            break

    for val in list(vals):
        try:
            if type(val) is int:
                continue

            parser.parse(val)
            return True
        except ValueError:
            pass

    return False

回答by jakevdp

I would use pd.to_datetime, and catch exceptions on columns that don't work. For example:

我会使用pd.to_datetime, 并捕获不起作用的列上的异常。例如:

import pandas as pd

df = pd.read_csv('test.csv')

for col in df.columns:
    if df[col].dtype == 'object':
        try:
            df[col] = pd.to_datetime(df[col])
        except ValueError:
            pass

df.dtypes
# (object, datetime64[ns], int64)

I believe this is as close to "automatic" as you can get for this application.

我相信这与您可以为此应用程序获得的“自动”一样接近。

回答by Romain

You can avoid a forloop and use the parameter errors='ignore'to avoid modifying unwanted values. In the code below we apply a to_datetimetransformation (ignoring errors) on all object columns (other columns are returned as is).

您可以避免for循环并使用参数errors='ignore'来避免修改不需要的值。在下面的代码中,我们to_datetime对所有对象列(其他列按原样返回)应用转换(忽略错误)。

If ‘ignore', then invalid parsing will return the input

如果 'ignore',则无效解析将返回输入

df = df.apply(lambda col: pd.to_datetime(col, errors='ignore') 
              if col.dtypes == object 
              else col, 
              axis=0)

df.dtypes

# 0            object
# 1    datetime64[ns]
# 2             int64