Pandas DataFrame.groupby 包括索引
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32415452/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas DataFrame.groupby including index
提问by ???S???
I have a dataset taken from the Windows Eventlog. The TimeGeneratedcolumn is set as the index. I'd like to get an aggregated view showing me the number of events, by EventType(info/warn/err) and by the index value. I could use resample()to set the datetime resolution (day, business day, etc).
我有一个取自 Windows 事件日志的数据集。该TimeGenerated列被设置为索引。我想获得一个聚合视图,通过EventType(info/warn/err) 和索引值显示事件数量。我可以resample()用来设置日期时间分辨率(天、工作日等)。
Here's my DataFrame:
这是我的数据帧:
log.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 80372 entries, 2015-08-31 12:15:23 to 2015-05-11 04:08:07
Data columns (total 4 columns):
EventID 80372 non-null int64
SourceName 80372 non-null object
EventType 76878 non-null object
EventCategory 80372 non-null int64
dtypes: int64(2), object(2)
memory usage: 3.1+ MB
I can of course group by the EventType, but this drops my index:
我当然可以按 EventType 分组,但这会降低我的索引:
log[['EventID', 'EventType']].groupby('EventType').count('EventID')
I would have to specify my existing index it in the call to groupby(), but how can I reference the index? Or do I have to perform a reset_index()before the groupby()call? Or am I simply going about this all wrong and is it painfully obvious that I'm a Pandas newbie? ;-)
我必须在对 的调用中指定我现有的索引groupby(),但如何引用索引?还是我必须reset_index()在groupby()通话前执行?或者我只是把这一切都弄错了,我是 Pandas 的新手是否很明显?;-)
Version info:
版本信息:
- Python 3.4.2
- pandas 0.16.2
- numpy 1.9.2
- 蟒蛇 3.4.2
- Pandas 0.16.2
- 麻木 1.9.2
Update
更新
To clarify further, what I'd like to achieve is:
为了进一步澄清,我想实现的是:
- A count of the EventIDs (the number of events)
- By EventType (in axis 1)
- By Timestamp (in axis 0)
- EventID 的计数(事件数)
- 按事件类型(在轴 1 中)
- 按时间戳(在轴 0 中)
Note that the Timestamp is not unique (in the raw DF), as multiple events can occur simultaneously.
请注意,时间戳不是唯一的(在原始 DF 中),因为多个事件可以同时发生。
One way I've been able to achieve what I wanted, is by doing:
我能够实现我想要的一种方法是:
temp = log.reset_index()
temp.groupby(['TimeGenerated','EventType']).count('EventID'['EventID'].unstack().fillna(0)
In that case, my output is:
在这种情况下,我的输出是:
Which then allows me to resample the count further, e.g. :
然后允许我进一步重新采样计数,例如:
temp.resample('MS', how='sum')
This works, but what I don't know if whether having to perform a reset_index()is necessary to achieve this grouping. Could I have done it in a better (read: more efficient) way?
这有效,但我不知道是否必须执行 areset_index()才能实现此分组。我可以以更好(阅读:更有效)的方式完成它吗?
回答by ???S???
What I was missing is that you can perform a groupby()on one or more levels of your index.
我缺少的是您可以groupby()在索引的一个或多个级别上执行 a 。
test = log.set_index('EventType', append=True)
test = test.groupby(level=[0,1])['EventID'].count('EventID')
test.unstack().fillna(0)
Alternatively, the suggestion by Brian Pendleton worked as well:
或者,Brian Pendleton 的建议也有效:
pd.get_dummies(log.EventType)
The difference with this last approach is that it doesn't work as well if you need to add another level in your column axis (e.g. by Hostname). But that wasn't part of the original question of course.
与最后一种方法的不同之处在于,如果您需要在列轴中添加另一个级别(例如,通过主机名),则效果不佳。但这当然不是原始问题的一部分。


