Pandas DataFrame.groupby 包括索引

Question

提问by ???S???

I have a dataset taken from the Windows Eventlog. The TimeGeneratedcolumn is set as the index. I'd like to get an aggregated view showing me the number of events, by EventType(info/warn/err) and by the index value. I could use resample()to set the datetime resolution (day, business day, etc).

我有一个取自 Windows 事件日志的数据集。该TimeGenerated列被设置为索引。我想获得一个聚合视图，通过EventType(info/warn/err) 和索引值显示事件数量。我可以resample()用来设置日期时间分辨率（天、工作日等）。

Here's my DataFrame:

这是我的数据帧：

log.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 80372 entries, 2015-08-31 12:15:23 to 2015-05-11 04:08:07
Data columns (total 4 columns):
EventID          80372 non-null int64
SourceName       80372 non-null object
EventType        76878 non-null object
EventCategory    80372 non-null int64
dtypes: int64(2), object(2)
memory usage: 3.1+ MB

I can of course group by the EventType, but this drops my index:

我当然可以按 EventType 分组，但这会降低我的索引：

log[['EventID', 'EventType']].groupby('EventType').count('EventID')

I would have to specify my existing index it in the call to groupby(), but how can I reference the index? Or do I have to perform a reset_index()before the groupby()call? Or am I simply going about this all wrong and is it painfully obvious that I'm a Pandas newbie? ;-)

我必须在对的调用中指定我现有的索引groupby()，但如何引用索引？还是我必须reset_index()在groupby()通话前执行？或者我只是把这一切都弄错了，我是 Pandas 的新手是否很明显？;-)

Version info:

版本信息：

Python 3.4.2
pandas 0.16.2
numpy 1.9.2

蟒蛇 3.4.2
Pandas 0.16.2
麻木 1.9.2

Update

更新

To clarify further, what I'd like to achieve is:

为了进一步澄清，我想实现的是：

A count of the EventIDs (the number of events)
By EventType (in axis 1)
By Timestamp (in axis 0)

EventID 的计数（事件数）
按事件类型（在轴 1 中）
按时间戳（在轴 0 中）

Note that the Timestamp is not unique (in the raw DF), as multiple events can occur simultaneously.

请注意，时间戳不是唯一的（在原始 DF 中），因为多个事件可以同时发生。

One way I've been able to achieve what I wanted, is by doing:

我能够实现我想要的一种方法是：

temp = log.reset_index()
temp.groupby(['TimeGenerated','EventType']).count('EventID'['EventID'].unstack().fillna(0)

In that case, my output is:

在这种情况下，我的输出是：

Which then allows me to resample the count further, e.g. :

然后允许我进一步重新采样计数，例如：

temp.resample('MS', how='sum')

This works, but what I don't know if whether having to perform a reset_index()is necessary to achieve this grouping. Could I have done it in a better (read: more efficient) way?

这有效，但我不知道是否必须执行 areset_index()才能实现此分组。我可以以更好（阅读：更有效）的方式完成它吗？

Answer 1

回答by ???S???

What I was missing is that you can perform a groupby()on one or more levels of your index.

我缺少的是您可以groupby()在索引的一个或多个级别上执行 a 。

test = log.set_index('EventType', append=True)
test = test.groupby(level=[0,1])['EventID'].count('EventID')
test.unstack().fillna(0)

Alternatively, the suggestion by Brian Pendleton worked as well:

或者，Brian Pendleton 的建议也有效：

pd.get_dummies(log.EventType)

The difference with this last approach is that it doesn't work as well if you need to add another level in your column axis (e.g. by Hostname). But that wasn't part of the original question of course.

与最后一种方法的不同之处在于，如果您需要在列轴中添加另一个级别（例如，通过主机名），则效果不佳。但这当然不是原始问题的一部分。

Pandas DataFrame.groupby 包括索引

提问by ???S???

Update

更新

回答by ???S???

相关推荐

最近更新

标签

Pandas DataFrame.groupby 包括索引

提问by ???S???

Update

更新

回答by ???S???

相关推荐

pandas 如何从熊猫数据框中提取单元格

pandas 使用熊猫将字符串拆分为数字和文本

使用 numpy/pandas 按时间戳合并时间序列数据

pandas to_sql 给出 unicode 解码错误

相关推荐

最近更新

标签