pandas dask 数据框如何将列转换为 to_datetime
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39584118/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
dask dataframe how to convert column to to_datetime
提问by dleal
I am trying to convert one column of my dataframe to datetime. Following the discussion here https://github.com/dask/dask/issues/863I tried the following code:
我正在尝试将数据框的一列转换为日期时间。在https://github.com/dask/dask/issues/863的讨论之后,我尝试了以下代码:
import dask.dataframe as dd
df['time'].map_partitions(pd.to_datetime, columns='time').compute()
But I am getting the following error message
但我收到以下错误消息
ValueError: Metadata inference failed, please provide `meta` keyword
What exactly should I put under meta? should I put a dictionary of ALL the columns in df or only of the 'time' column? and what type should I put? I have tried dtype and datetime64 but none of them work so far.
我到底应该在 meta 下放什么?我应该把所有列的字典放在 df 中还是只放在“时间”列中?我应该放什么类型?我已经尝试过 dtype 和 datetime64,但到目前为止它们都没有工作。
Thank you and I appreciate your guidance,
谢谢你,我感谢你的指导,
Update
更新
I will include here the new error messages:
我将在此处包含新的错误消息:
1) Using Timestamp
1) 使用时间戳
df['trd_exctn_dt'].map_partitions(pd.Timestamp).compute()
TypeError: Cannot convert input to Timestamp
2) Using datetime and meta
2)使用日期时间和元
meta = ('time', pd.Timestamp)
df['time'].map_partitions(pd.to_datetime,meta=meta).compute()
TypeError: to_datetime() got an unexpected keyword argument 'meta'
3) Just using date time: gets stuck at 2%
3) 仅使用日期时间:卡在 2%
In [14]: df['trd_exctn_dt'].map_partitions(pd.to_datetime).compute()
[ ] | 2% Completed | 2min 20.3s
Also, I would like to be able to specify the format in the date, as i would do in pandas:
另外,我希望能够在日期中指定格式,就像我在Pandas中所做的那样:
pd.to_datetime(df['time'], format = '%m%d%Y'
Update 2
更新 2
After updating to Dask 0.11, I no longer have problems with the meta keyword. Still, I can't get it past 2% on a 2GB dataframe.
更新到 Dask 0.11 后,meta 关键字不再有问题。尽管如此,我无法在 2GB 数据帧上超过 2%。
df['trd_exctn_dt'].map_partitions(pd.to_datetime, meta=meta).compute()
[ ] | 2% Completed | 30min 45.7s
Update 3
更新 3
worked better this way:
这样工作得更好:
def parse_dates(df):
return pd.to_datetime(df['time'], format = '%m/%d/%Y')
df.map_partitions(parse_dates, meta=meta)
I'm not sure whether it's the right approach or not
我不确定这是否是正确的方法
回答by MRocklin
Use astype
用 astype
You can use the astype
method to convert the dtype of a series to a NumPy dtype
您可以使用该astype
方法将系列的 dtype 转换为 NumPy dtype
df.time.astype('M8[us]')
There is probably a way to specify a Pandas style dtype as well (edits welcome)
可能还有一种方法可以指定 Pandas 样式的 dtype(欢迎编辑)
Use map_partitions and meta
使用 map_partitions 和 meta
When using black-box methods like map_partitions
, dask.dataframe needs to know the type and names of the output. There are a few ways to do this listed in the docstring for map_partitions
.
使用map_partitions
dask.dataframe等黑盒方法时,需要知道输出的类型和名称。的文档字符串中列出了几种方法来执行此操作map_partitions
。
You can supply an empty Pandas object with the right dtype and name
您可以提供具有正确数据类型和名称的空 Pandas 对象
meta = pd.Series([], name='time', dtype=pd.Timestamp)
Or you can provide a tuple of (name, dtype)
for a Series or a dict for a DataFrame
或者,您可以为(name, dtype)
Series提供一个元组或为 DataFrame提供一个 dict
meta = ('time', pd.Timestamp)
Then everything should be fine
那么一切都应该没问题
df.time.map_partitions(pd.to_datetime, meta=meta)
If you were calling map_partitions
on df
instead then you would need to provide the dtypes for everything. That isn't the case in your example though.
如果您正在调用map_partitions
,df
那么您需要为所有内容提供 dtypes。但是,在您的示例中情况并非如此。
回答by Arundathi
回答by tmsss
I'm not sure if it this is the right approach, but mapping the column worked for me:
我不确定这是否是正确的方法,但映射列对我有用:
df['time'] = df['time'].map(lambda x: pd.to_datetime(x, errors='coerce'))
回答by citynorman
This worked for me
这对我有用
ddf["Date"] = ddf["Date"].map_partitions(pd.to_datetime,format='%d/%m/%Y',meta = ('datetime64[ns]'))
ddf["Date"] = ddf["Date"].map_partitions(pd.to_datetime,format='%d/%m/%Y',meta = ('datetime64[ns]'))
回答by skibee
If the datetime is in a non ISO formatthen map_partition
yields better results:
如果日期时间是非 ISO 格式,则会map_partition
产生更好的结果:
import dask
import pandas as pd
from dask.distributed import Client
client = Client()
ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
.apply(lambda x: x[1]+' '+x[0], meta=('object')))
%%timeit
ddf.datetime = ddf.datetime.astype('M8[s]')
ddf.compute()
11.3 s ± 719 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
每个循环 11.3 s ± 719 ms(7 次运行的平均值 ± 标准偏差,每个循环 1 次)
ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
.apply(lambda x: x[1]+' '+x[0], meta=('object')))
%%timeit
ddf.datetime_nonISO = (ddf.datetime_nonISO.map_partitions(pd.to_datetime
, format='%H:%M:%S %Y-%m-%d', meta=('datetime64[s]')))
ddf.compute()
8.78 s ± 599 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
每个循环 8.78 s ± 599 ms(平均值 ± 标准偏差,7 次运行,每次 1 次循环)
ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
.apply(lambda x: x[1]+' '+x[0], meta=('object')))
%%timeit
ddf.datetime_nonISO = ddf.datetime_nonISO.astype('M8[s]')
ddf.compute()
1min 8s ± 3.65 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
每个循环 1 分钟 8 秒 ± 3.65 秒(7 次运行的平均值 ± 标准偏差,每个循环 1 次)