pandas 如何自动检测熊猫数据框中包含日期时间的列

Question

提问by Carlos Mu?iz

I'm using Pandas' read_sql() function to read multiple SQL tables into DataFrames. This function requires a pre-specified list of column names which should be read as datetime objects using the 'parse_dates' parameter but the function does not infer datetimes automatically from varchar columns in the server. Because of this, I get DataFrames in which all columns are of dtype Object.

我正在使用 Pandas 的 read_sql() 函数将多个 SQL 表读入 DataFrames。此函数需要一个预先指定的列名列表，这些列名应该使用 'parse_dates' 参数读取为日期时间对象，但该函数不会从服务器中的 varchar 列自动推断日期时间。因此，我得到了所有列都是 dtype Object 的 DataFrame。

   col1    col2
   -----------------------------------
0  A       2017-02-04 10:41:00.0000000
1  B       2017-02-04 10:41:00.0000000
2  C       2017-02-04 10:41:00.0000000
3  D       2017-02-04 10:41:00.0000000
4  E       2017-02-03 06:13:00.0000000

Is there a built-in Pandas function to automatically infer columns which should be datetime64[ns] WITHOUThaving to specify the column names?

是否有一个内置的Pandas功能，自动推断这应该是datetime64 [NS]列WITHOUT不必指定列名？

I've tried:

我试过了：

df.apply(pd.to_datetime(df, infer_datetime_format=True), axis=1)

which results in an error:

这导致错误：

to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing

I also tried:

我也试过：

pd.to_datetime(df.stack(),  errors='ignore', format='%Y%m%d% H%M%S%f').unstack()

and

和

pd.to_datetime(df.stack(),  errors='coerce', format='%Y%m%d% H%M%S%f').unstack()

But this does not work.

但这不起作用。

Any suggestions about how to infer datetime columns automatically after the DataFrame is constructed?

关于如何在构造 DataFrame 后自动推断日期时间列的任何建议？

Answer 1

回答by Bharath

There is no builtin currently to convert object to datetime automatically. One simple way is based on list comprehension and regex pattern of the datetime varchar ie.

当前没有内置函数可以自动将对象转换为日期时间。一种简单的方法是基于日期时间 varchar 的列表理解和正则表达式模式，即。

If you have a df (based on @Alexander's df)

如果你有一个 df（基于@Alexander 的 df）

df = pd.DataFrame( {'col1': ['A', 'B', 'C', 'D', 'E'],
  'col2': ['2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-03 14:13:00'],
  'col3': [0, 1, 2, 3, 4],
  'col4': ['2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-03 14:13:00']})

data = [pd.to_datetime(df[x]) if df[x].astype(str).str.match(r'\d{4}-\d{2}-\d{2} \d{2}\:\d{2}\:\d{2}').all() else df[x] for x in df.columns]

df = pd.concat(data, axis=1, keys=[s.name for s in data])

or with the help of a mask i.e

或在面具的帮助下，即

mask = df.astype(str).apply(lambda x : x.str.match(r'\d{4}-\d{2}-\d{2} \d{2}\:\d{2}\:\d{2}').all())
df.loc[:,mask] = df.loc[:,mask].apply(pd.to_datetime)

df.types

Output:

输出：

col1            object
col2    datetime64[ns]
col3             int64
col4    datetime64[ns]
dtype: object

If you have mixed date formats then you can use r'(\d{2,4}-\d{2}-\d{2,4})+'Eg:

如果您有混合日期格式，那么您可以使用r'(\d{2,4}-\d{2}-\d{2,4})+'例如：

ndf = pd.DataFrame({'col3': [0, 1, 2, 3, 4],
  'col4': ['2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-03 14:13:00'],
  'col5': ['2017-02-04',
       '2017-02-04',
       '17-02-2004 14:13:00',
       '17-02-2014',
       '2017-02-03']})

mask = ndf.astype(str).apply(lambda x : x.str.match(r'(\d{2,4}-\d{2}-\d{2,4})+').all())
ndf.loc[:,mask] = ndf.loc[:,mask].apply(pd.to_datetime)

Output :

输出：

   col3                col4                col5
0     0 2017-02-04 18:41:00 2017-02-04 00:00:00
1     1 2017-02-04 18:41:00 2017-02-04 00:00:00
2     2 2017-02-04 18:41:00 2004-02-17 14:13:00
3     3 2017-02-04 18:41:00 2014-02-17 00:00:00
4     4 2017-02-03 14:13:00 2017-02-03 00:00:00

Hope it helps

希望能帮助到你

Answer 2

回答by Alexander

You can identify which columns in your dataframe are of type objectand then only convert those columns to datetime using coerce=Trueso that errors are generated for columns which cannot be converted. Use combine_firstto overwrite the values in your dataframe with the timestamp values that did convert to datetimes.

您可以识别数据框中的哪些列是类型object，然后仅将这些列转换为 datetime 使用，coerce=True以便为无法转换的列生成错误。用于使用已combine_first转换为日期时间的时间戳值覆盖数据帧中的值。

df = pd.DataFrame(
     {'col1': ['A', 'B', 'C', 'D', 'E'],
      'col2': ['2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-03 14:13:00'],
      'col3': [0, 1, 2, 3, 4]})

object_cols = [col for col, col_type in df.dtypes.iteritems() if col_type == 'object']

df.loc[:, object_cols] = df[object_cols].combine_first(df[object_cols].apply(
    pd.to_datetime, coerce=True))
>>> df
  col1                col2  col3
0    A 2017-02-04 18:41:00     0
1    B 2017-02-04 18:41:00     1
2    C 2017-02-04 18:41:00     2
3    D 2017-02-04 18:41:00     3
4    E 2017-02-03 14:13:00     4

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 3 columns):
col1    5 non-null object
col2    5 non-null datetime64[ns]
col3    5 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 160.0+ bytes

pandas 如何自动检测熊猫数据框中包含日期时间的列

提问by Carlos Mu?iz

回答by Bharath

回答by Alexander

相关推荐

最近更新

标签

pandas 如何自动检测熊猫数据框中包含日期时间的列

提问by Carlos Mu?iz

回答by Bharath

回答by Alexander

相关推荐

pandas 根据布尔值列表返回数据帧子集

pandas 大熊猫应用带参数的函数

如何根据 Pandas 数据框中的两个或多个子集条件删除重复项

pandas 如何删除Python中缺失值过多的列

相关推荐

最近更新

标签