Pandas：时间戳系列中的唯一天数

Question

提问by marillion

I have Pandas DataFrame with nearly 3,000,000 rows. One of the columns is called TIMESTAMP, and of the datetime64 type. The timestamp format is given below:

我有将近 3,000,000 行的 Pandas DataFrame。其中一列称为TIMESTAMP, 并且属于 datetime64 类型。时间戳格式如下：

2015-03-31 22:56:45.510

My goal is calculating the number of days data were collected. My initial approach was simple:

我的目标是计算收集数据的天数。我最初的方法很简单：

(df.TIMESTAMP.max() - df.TIMESTAMP.min()).days

However, it occured to me this may not be always correct, since there is no guarantee data was collected everyday. Instead, I tried counting unique days in the timestamp series using mapand apply, and both take a considerable amount of time for 3,000,000 rows:

但是，我发现这可能并不总是正确的，因为不能保证每天收集数据。相反，我尝试使用map和来计算时间戳系列中的唯一天数apply，并且对于 3,000,000 行都需要相当长的时间：

%timeit len(df['TIMESTAMP'].map(lambda t: t.date()).unique())
1 loops, best of 3: 41.3 s per loop

%timeit len(df['TIMESTAMP'].apply(lambda t: t.date()).unique())
1 loops, best of 3: 42.3 s per loop

Is there a way to speed up this computation, or an entirely different but better approach?

有没有办法加速这个计算，或者一个完全不同但更好的方法？

Thanks!

谢谢！

Answer 1

回答by Andy Hayden

To get the unique dates you should first normalize(to get the time at midnight that day, note this is fast), then use unique:

要获得您应该首先获得的唯一日期normalize（要获得当天午夜的时间，请注意这很快），然后使用unique：

In [31]: df["Time"].dt.normalize().unique()
Out[31]:
array(['2014-12-31T16:00:00.000000000-0800',
       '2015-01-01T16:00:00.000000000-0800',
       '2015-01-02T16:00:00.000000000-0800',
       '2015-01-04T16:00:00.000000000-0800',
       '2015-01-05T16:00:00.000000000-0800'], dtype='datetime64[ns]')

Original answer (I misread question):

原始答案（我误读了问题）：

To get the countscould use normalizeand then use value_counts:

要获得计数可以使用normalize然后使用value_counts：

In [11]: df
Out[11]:
        Time
0 2015-01-01
1 2015-01-02
2 2015-01-03
3 2015-01-03
4 2015-01-05
5 2015-01-06

In [12]: df['Time'].dt.normalize().value_counts()
Out[12]:
2015-01-03    2
2015-01-06    1
2015-01-02    1
2015-01-05    1
2015-01-01    1
Name: Time, dtype: int64

but perhaps the cleaner option is to resample (though I'm not sure if this is less efficient):

但也许更清洁的选择是重新采样（尽管我不确定这是否效率较低）：

In [21]: pd.Series(1, df['Time']).resample("D", how="sum")
Out[21]:
Time
2015-01-01     1
2015-01-02     1
2015-01-03     2
2015-01-04   NaN
2015-01-05     1
2015-01-06     1
Freq: D, dtype: float64

Answer 2

回答by reptilicus

If your index is a DateTimeIndex, I think you can do something like this:

如果您的索引是 DateTimeIndex，我认为您可以执行以下操作：

print(df.groupby(df.index.date).shape)

Pandas：时间戳系列中的唯一天数

提问by marillion

回答by Andy Hayden

回答by reptilicus

相关推荐

最近更新

标签

Pandas：时间戳系列中的唯一天数

提问by marillion

回答by Andy Hayden

回答by reptilicus

相关推荐

来自带有列表的字典的 Pandas DataFrame

使用 Pandas 使用分隔符读取 txt 文件创建 NaNs 列

具有 MultiIndex 到 Numpy 矩阵的 Pandas DataFrame

pandas Python“接口错误：错误绑定参数 2 - 可能是不受支持的类型。”

相关推荐

最近更新

标签