Python 熊猫可以自动识别日期吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17465045/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:17:25  来源:igfitidea点击:

Can pandas automatically recognize dates?

pythondatetypesdataframepandas

提问by Roman

Today I was positively surprised by the fact that while reading data from a data file (for example) pandas is able to recognize types of values:

今天,当从数据文件(例如)读取数据时,pandas 能够识别值的类型这一事实让我感到非常惊讶:

df = pandas.read_csv('test.dat', delimiter=r"\s+", names=['col1','col2','col3'])

For example it can be checked in this way:

例如,它可以通过这种方式检查:

for i, r in df.iterrows():
    print type(r['col1']), type(r['col2']), type(r['col3'])

In particular integer, floats and strings were recognized correctly. However, I have a column that has dates in the following format: 2013-6-4. These dates were recognized as strings (not as python date-objects). Is there a way to "learn" pandas to recognized dates?

特别是整数,浮点数和字符串被正确识别。但是,我有一列日期格式如下:2013-6-4. 这些日期被识别为字符串(而不是 python 日期对象)。有没有办法“学习”熊猫识别日期?

采纳答案by Rutger Kassies

You should add parse_dates=True, or parse_dates=['column name']when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.

您应该添加parse_dates=True,或者parse_dates=['column name']在阅读时,这通常足以神奇地解析它。但总有一些奇怪的格式需要手动定义。在这种情况下,您还可以添加日期解析器功能,这是最灵活的方式。

Suppose you have a column 'datetime' with your string, then:

假设您有一个包含字符串的“datetime”列,那么:

dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

This way you can even combine multiple columns into a single datetime column, this merges a 'date' and a 'time' column into a single 'datetime' column:

通过这种方式,您甚至可以将多列合并为一个日期时间列,这会将“日期”和“时间”列合并为一个“日期时间”列:

dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

You can find directives (i.e. the letters to be used for different formats) for strptimeand strftimein this page.

你可以找到指令(即用于不同格式的字母)的strptimestrftime在这个页面

回答by Joop

pandas read_csv method is great for parsing dates. Complete documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

pandas read_csv 方法非常适合解析日期。完整文档位于http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

you can even have the different date parts in different columns and pass the parameter:

您甚至可以在不同的列中使用不同的日期部分并传递参数:

parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a
separate date column. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date
column. {‘foo' : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo'

The default sensing of dates works great, but it seems to be biased towards north american Date formats. If you live elsewhere you might occasionally be caught by the results. As far as I can remember 1/6/2000 means 6 January in the USA as opposed to 1 Jun where I live. It is smart enough to swing them around if dates like 23/6/2000 are used. Probably safer to stay with YYYYMMDD variations of date though. Apologies to pandas developers,here but i have not tested it with local dates recently.

日期的默认检测效果很好,但它似乎偏向于北美日期格式。如果你住在其他地方,你可能偶尔会被结果所吸引。据我所知,2000 年 1 月 6 日在美国意味着 1 月 6 日,而不是我居住的地方 6 月 1 日。如果使用 23/6/2000 之类的日期,它就足够聪明地摆动它们。不过,保留日期的 YYYYMMDD 变体可能更安全。在这里向 Pandas 开发人员道歉,但我最近没有用本地日期对其进行测试。

you can use the date_parser parameter to pass a function to convert your format.

您可以使用 date_parser 参数传递一个函数来转换您的格式。

date_parser : function
Function to use for converting a sequence of string columns to an array of datetime
instances. The default uses dateutil.parser.parser to do the conversion.

回答by Sean

Perhaps the pandas interface has changed since @Rutger answered, but in the version I'm using (0.15.2), the date_parserfunction receives a list of dates instead of a single value. In this case, his code should be updated like so:

也许自从@Rutger 回答以来,pandas 界面已经改变,但在我使用的版本 (0.15.2) 中,该date_parser函数接收日期列表而不是单个值。在这种情况下,他的代码应该像这样更新:

dateparse = lambda dates: [pd.datetime.strptime(d, '%Y-%m-%d %H:%M:%S') for d in dates]

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

回答by Gaurav

Yes - according to the pandas.read_csvdocumentation:

是 - 根据pandas.read_csv文档

Note: A fast-path exists for iso8601-formatteddates.

注意:iso8601 格式的日期存在快速路径。

So if your csv has a column named datetimeand the dates looks like 2013-01-01T01:01for example, running this will make pandas (I'm on v0.19.2) pick up the date and time automatically:

因此,如果您的 csv 有一列命名datetime并且日期看起来像2013-01-01T01:01例如,运行它将使熊猫(我在 v0.19.2 上)自动获取日期和时间:

df = pd.read_csv('test.csv', parse_dates=['datetime'])

df = pd.read_csv('test.csv', parse_dates=['datetime'])

Note that you need to explicitly pass parse_dates, it doesn't work without.

请注意,您需要显式传递parse_dates,否则它不起作用。

Verify with:

验证:

df.dtypes

df.dtypes

You should see the datatype of the column is datetime64[ns]

你应该看到列的数据类型是 datetime64[ns]

回答by Eugene Yarmash

You could use pandas.to_datetime()as recommended in the documentation for pandas.read_csv():

您可以pandas.to_datetime()按照文档中的建议使用pandas.read_csv()

If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetimeafter pd.read_csv.

如果列或索引包含无法解析的日期,则整个列或索引将作为对象数据类型原封不动地返回。对于非标准日期时间解析,请使用pd.to_datetimeafter pd.read_csv

Demo:

演示:

>>> D = {'date': '2013-6-4'}
>>> df = pd.DataFrame(D, index=[0])
>>> df
       date
0  2013-6-4
>>> df.dtypes
date    object
dtype: object
>>> df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d')
>>> df
        date
0 2013-06-04
>>> df.dtypes
date    datetime64[ns]
dtype: object

回答by IamTheWalrus

When merging two columns into a single datetime column, the accepted answer generates an error (pandas version 0.20.3), since the columns are sent to the date_parser function separately.

将两列合并为单个日期时间列时,接受的答案会生成错误(pandas 版本 0.20.3),因为这些列分别发送到 date_parser 函数。

The following works:

以下工作:

def dateparse(d,t):
    dt = d + " " + t
    return pd.datetime.strptime(dt, '%d/%m/%Y %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

回答by Mr_and_Mrs_D

If performance matters to you make sure you time:

如果性能对您很重要,请确保您有时间:

import sys
import timeit
import pandas as pd

print('Python %s on %s' % (sys.version, sys.platform))
print('Pandas version %s' % pd.__version__)

repeat = 3
numbers = 100

def time(statement, _setup=None):
    print (min(
        timeit.Timer(statement, setup=_setup or setup).repeat(
            repeat, numbers)))

print("Format %m/%d/%y")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,07/29/15
x2,07/29/15
x3,07/29/15
x4,07/30/15
x5,07/29/15
x6,07/29/15
x7,07/29/15
y7,08/05/15
x8,08/05/15
z3,08/05/15
''' * 100)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%m/%d/%y")); data.seek(0)')

print("Format %Y-%m-%d %H:%M:%S")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,2016-10-15 00:00:43
x2,2016-10-15 00:00:56
x3,2016-10-15 00:00:56
x4,2016-10-15 00:00:12
x5,2016-10-15 00:00:34
x6,2016-10-15 00:00:55
x7,2016-10-15 00:00:06
y7,2016-10-15 00:00:01
x8,2016-10-15 00:00:00
z3,2016-10-15 00:00:02
''' * 1000)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%Y-%m-%d %H:%M:%S")); data.seek(0)')

prints:

印刷:

Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) 
[Clang 6.0 (clang-600.0.57)] on darwin
Pandas version 0.23.4
Format %m/%d/%y
0.19123052499999993
8.20691274
8.143124389
1.2384357139999977
Format %Y-%m-%d %H:%M:%S
0.5238807110000039
0.9202787830000005
0.9832778819999959
12.002349824999996

So with iso8601-formatted date (%Y-%m-%d %H:%M:%Sis apparently an iso8601-formatted date, I guess the T can be droppedand replaced by a space) you should notspecify infer_datetime_format(which does not make a difference with more common ones either apparently) and passing your own parser in just cripples performance. On the other hand, date_parserdoes make a difference with not so standard day formats. Be sure to time before you optimize, as usual.

因此,与ISO8601格式的日期(%Y-%m-%d %H:%M:%S显然是一个ISO8601格式的日期,我猜的T可以被丢弃,并用空格代替),你应该指定infer_datetime_format(不使更多常见的两种明显的差异),并通过自己的解析器只会削弱性能。另一方面,date_parser不那么标准的日期格式确实有所作为。像往常一样,在优化之前一定要计时。

回答by kamran kausar

While loading csv file contain date column.We have two approach to to make pandas to recognize date column i.e

加载包含日期列的 csv 文件时,我们有两种方法让熊猫识别日期列,即

  1. Pandas explicit recognize the format by arg date_parser=mydateparser

  2. Pandas implicit recognize the format by agr infer_datetime_format=True

  1. Pandas 通过 arg 显式识别格式 date_parser=mydateparser

  2. Pandas 通过 agr 隐式识别格式 infer_datetime_format=True

Some of the date column data

一些日期列数据

01/01/18

01/01/18

01/02/18

01/02/18

Here we don't know the first two things It may be month or day. So in this case we have to use Method 1:- Explicit pass the format

这里我们不知道前两件事 可能是月或日。所以在这种情况下我们必须使用方法1:-显式传递格式

    mydateparser = lambda x: pd.datetime.strptime(x, "%m/%d/%y")
    df = pd.read_csv(file_name, parse_dates=['date_col_name'],
date_parser=mydateparser)

Method 2:- Implicit or Automatically recognize the format

方法 2:- 隐式或自动识别格式

df = pd.read_csv(file_name, parse_dates=[date_col_name],infer_datetime_format=True)