pandas dask 数据框如何将列转换为 to_datetime

Question

提问by dleal

I am trying to convert one column of my dataframe to datetime. Following the discussion here https://github.com/dask/dask/issues/863I tried the following code:

我正在尝试将数据框的一列转换为日期时间。在https://github.com/dask/dask/issues/863的讨论之后，我尝试了以下代码：

import dask.dataframe as dd
df['time'].map_partitions(pd.to_datetime, columns='time').compute()

But I am getting the following error message

但我收到以下错误消息

ValueError: Metadata inference failed, please provide `meta` keyword

What exactly should I put under meta? should I put a dictionary of ALL the columns in df or only of the 'time' column? and what type should I put? I have tried dtype and datetime64 but none of them work so far.

我到底应该在 meta 下放什么？我应该把所有列的字典放在 df 中还是只放在“时间”列中？我应该放什么类型？我已经尝试过 dtype 和 datetime64，但到目前为止它们都没有工作。

Thank you and I appreciate your guidance,

谢谢你，我感谢你的指导，

Update

更新

I will include here the new error messages:

我将在此处包含新的错误消息：

1) Using Timestamp

1) 使用时间戳

df['trd_exctn_dt'].map_partitions(pd.Timestamp).compute()

TypeError: Cannot convert input to Timestamp

2) Using datetime and meta

2）使用日期时间和元

meta = ('time', pd.Timestamp)
df['time'].map_partitions(pd.to_datetime,meta=meta).compute()
TypeError: to_datetime() got an unexpected keyword argument 'meta'

3) Just using date time: gets stuck at 2%

3) 仅使用日期时间：卡在 2%

    In [14]: df['trd_exctn_dt'].map_partitions(pd.to_datetime).compute()
[                                        ] | 2% Completed |  2min 20.3s

Also, I would like to be able to specify the format in the date, as i would do in pandas:

另外，我希望能够在日期中指定格式，就像我在Pandas中所做的那样：

pd.to_datetime(df['time'], format = '%m%d%Y'

Update 2

更新 2

After updating to Dask 0.11, I no longer have problems with the meta keyword. Still, I can't get it past 2% on a 2GB dataframe.

更新到 Dask 0.11 后，meta 关键字不再有问题。尽管如此，我无法在 2GB 数据帧上超过 2%。

df['trd_exctn_dt'].map_partitions(pd.to_datetime, meta=meta).compute()
    [                                        ] | 2% Completed |  30min 45.7s

Update 3

更新 3

worked better this way:

这样工作得更好：

def parse_dates(df):
  return pd.to_datetime(df['time'], format = '%m/%d/%Y')

df.map_partitions(parse_dates, meta=meta)

I'm not sure whether it's the right approach or not

我不确定这是否是正确的方法

Answer 1

回答by MRocklin

Use `astype`

用 `astype`

You can use the astypemethod to convert the dtype of a series to a NumPy dtype

您可以使用该astype方法将系列的 dtype 转换为 NumPy dtype

df.time.astype('M8[us]')

There is probably a way to specify a Pandas style dtype as well (edits welcome)

可能还有一种方法可以指定 Pandas 样式的 dtype（欢迎编辑）

Use map_partitions and meta

使用 map_partitions 和 meta

When using black-box methods like map_partitions, dask.dataframe needs to know the type and names of the output. There are a few ways to do this listed in the docstring for map_partitions.

使用map_partitionsdask.dataframe等黑盒方法时，需要知道输出的类型和名称。的文档字符串中列出了几种方法来执行此操作map_partitions。

You can supply an empty Pandas object with the right dtype and name

您可以提供具有正确数据类型和名称的空 Pandas 对象

meta = pd.Series([], name='time', dtype=pd.Timestamp)

Or you can provide a tuple of (name, dtype)for a Series or a dict for a DataFrame

或者，您可以为(name, dtype)Series提供一个元组或为 DataFrame提供一个 dict

meta = ('time', pd.Timestamp)

Then everything should be fine

那么一切都应该没问题

df.time.map_partitions(pd.to_datetime, meta=meta)

If you were calling map_partitionson dfinstead then you would need to provide the dtypes for everything. That isn't the case in your example though.

如果您正在调用map_partitions，df那么您需要为所有内容提供 dtypes。但是，在您的示例中情况并非如此。

Answer 2

回答by Arundathi

Dask also come with to_timedelta so this should work as well.

Dask 也带有 to_timedelta 所以这也应该有效。

df['time']=dd.to_datetime(df.time,unit='ns')

The values unit takes is the same as pd.to_timedelta in pandas. This can be found here.

单位采用的值与Pandas中的 pd.to_timedelta 相同。这可以在这里找到。

Answer 3

回答by tmsss

I'm not sure if it this is the right approach, but mapping the column worked for me:

我不确定这是否是正确的方法，但映射列对我有用：

df['time'] = df['time'].map(lambda x: pd.to_datetime(x, errors='coerce'))

Answer 4

回答by citynorman

This worked for me

这对我有用

ddf["Date"] = ddf["Date"].map_partitions(pd.to_datetime,format='%d/%m/%Y',meta = ('datetime64[ns]'))

Answer 5

回答by skibee

If the datetime is in a non ISO formatthen map_partitionyields better results:

如果日期时间是非 ISO 格式，则会map_partition产生更好的结果：

import dask
import pandas as pd
from dask.distributed import Client
client = Client()

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 

%%timeit
ddf.datetime = ddf.datetime.astype('M8[s]')
ddf.compute()

11.3 s ± 719 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

每个循环 11.3 s ± 719 ms（7 次运行的平均值 ± 标准偏差，每个循环 1 次）

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 


%%timeit
ddf.datetime_nonISO = (ddf.datetime_nonISO.map_partitions(pd.to_datetime
                       ,  format='%H:%M:%S %Y-%m-%d', meta=('datetime64[s]')))
ddf.compute()

8.78 s ± 599 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

每个循环 8.78 s ± 599 ms（平均值 ± 标准偏差，7 次运行，每次 1 次循环）

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 

%%timeit
ddf.datetime_nonISO = ddf.datetime_nonISO.astype('M8[s]')
ddf.compute()

1min 8s ± 3.65 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

每个循环 1 分钟 8 秒 ± 3.65 秒（7 次运行的平均值 ± 标准偏差，每个循环 1 次）

pandas dask 数据框如何将列转换为 to_datetime

提问by dleal

回答by MRocklin

Use `astype`

用 `astype`

Use map_partitions and meta

使用 map_partitions 和 meta

回答by Arundathi

回答by tmsss

回答by citynorman

回答by skibee

相关推荐

最近更新

标签

pandas dask 数据框如何将列转换为 to_datetime

提问by dleal

回答by MRocklin

Use astype

用 astype

Use map_partitions and meta

使用 map_partitions 和 meta

回答by Arundathi

回答by tmsss

回答by citynorman

回答by skibee

相关推荐

pandas 调用resample后如何用0填充（）？

pandas 用python划分两个数据帧

Pandas groupby 对象过滤

Pandas：如何根据 id 列表增加列的单元格值

相关推荐

最近更新

标签

Use `astype`

用 `astype`