Pandas:时间戳系列中的唯一天数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33638919/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:12:43  来源:igfitidea点击:

Pandas: Number of unique days in a timestamp Series

pythondatetimepandastime-series

提问by marillion

I have Pandas DataFrame with nearly 3,000,000 rows. One of the columns is called TIMESTAMP, and of the datetime64 type. The timestamp format is given below:

我有将近 3,000,000 行的 Pandas DataFrame。其中一列称为TIMESTAMP, 并且属于 datetime64 类型。时间戳格式如下:

2015-03-31 22:56:45.510

My goal is calculating the number of days data were collected. My initial approach was simple:

我的目标是计算收集数据的天数。我最初的方法很简单:

(df.TIMESTAMP.max() - df.TIMESTAMP.min()).days

However, it occured to me this may not be always correct, since there is no guarantee data was collected everyday. Instead, I tried counting unique days in the timestamp series using mapand apply, and both take a considerable amount of time for 3,000,000 rows:

但是,我发现这可能并不总是正确的,因为不能保证每天收集数据。相反,我尝试使用map和 来计算时间戳系列中的唯一天数apply,并且对于 3,000,000 行都需要相当长的时间:

%timeit len(df['TIMESTAMP'].map(lambda t: t.date()).unique())
1 loops, best of 3: 41.3 s per loop

%timeit len(df['TIMESTAMP'].apply(lambda t: t.date()).unique())
1 loops, best of 3: 42.3 s per loop

Is there a way to speed up this computation, or an entirely different but better approach?

有没有办法加速这个计算,或者一个完全不同但更好的方法?

Thanks!

谢谢!

回答by Andy Hayden

To get the unique dates you should first normalize(to get the time at midnight that day, note this is fast), then use unique:

要获得您应该首先获得的唯一日期normalize(要获得当天午夜时间,请注意这很快),然后使用unique

In [31]: df["Time"].dt.normalize().unique()
Out[31]:
array(['2014-12-31T16:00:00.000000000-0800',
       '2015-01-01T16:00:00.000000000-0800',
       '2015-01-02T16:00:00.000000000-0800',
       '2015-01-04T16:00:00.000000000-0800',
       '2015-01-05T16:00:00.000000000-0800'], dtype='datetime64[ns]')


Original answer (I misread question):

原始答案(我误读了问题):

To get the countscould use normalizeand then use value_counts:

要获得计数可以使用normalize然后使用value_counts

In [11]: df
Out[11]:
        Time
0 2015-01-01
1 2015-01-02
2 2015-01-03
3 2015-01-03
4 2015-01-05
5 2015-01-06

In [12]: df['Time'].dt.normalize().value_counts()
Out[12]:
2015-01-03    2
2015-01-06    1
2015-01-02    1
2015-01-05    1
2015-01-01    1
Name: Time, dtype: int64

but perhaps the cleaner option is to resample (though I'm not sure if this is less efficient):

但也许更清洁的选择是重新采样(尽管我不确定这是否效率较低):

In [21]: pd.Series(1, df['Time']).resample("D", how="sum")
Out[21]:
Time
2015-01-01     1
2015-01-02     1
2015-01-03     2
2015-01-04   NaN
2015-01-05     1
2015-01-06     1
Freq: D, dtype: float64

回答by reptilicus

If your index is a DateTimeIndex, I think you can do something like this:

如果您的索引是 DateTimeIndex,我认为您可以执行以下操作:

print(df.groupby(df.index.date).shape)