Pandas DataFrame 中的级别是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45235992/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:03:42  来源:igfitidea点击:

What are levels in a pandas DataFrame?

pythonpandasdataframemulti-index

提问by Raf

I've been reading through the documentation and many explanations and examples use levelsas something taken for granted. Imho the docs lack a bit on a fundamental explanation of the data structure and definitions.

我一直在阅读文档,许多解释和示例levels被视为理所当然。恕我直言,文档缺乏对数据结构和定义的基本解释。

What are levels in a data frame? What are levels in a MultiIndexindex?

数据框中的级别是什么?MultiIndex指数中的水平是什么?

回答by Andrzej Gis

I stumbled across this question while analyzing the answer to my own question, but I didn't find the John's answer satisfying enough. After a few experiments though I think I understood the levels and decided to share:

我在分析我自己问题的答案时偶然发现了这个问题,但我发现约翰的答案不够令人满意。经过几次实验,我想我理解了这些级别并决定分享:

Short answer:

简答:

Levels are parts of the index or column.

级别是索引或列的一部分。

Long answer:

长答案:

I think this multi-column DataFrame.groupbyexample illustrates the index levels quite nicely.

我认为这个多列DataFrame.groupby示例很好地说明了索引级别。

Let's say we have the time logged on issues report data:

假设我们有时间登录问题报告数据:

report = pd.DataFrame([
        [1, 10, 'John'],
        [1, 20, 'John'],
        [1, 30, 'Tom'],
        [1, 10, 'Bob'],
        [2, 25, 'John'],
        [2, 15, 'Bob']], columns = ['IssueKey','TimeSpent','User'])

   IssueKey  TimeSpent  User
0         1         10  John
1         1         20  John
2         1         30   Tom
3         1         10   Bob
4         2         25  John
5         2         15   Bob

The index here has only 1 level (there is only one index value identifying every row). The index is artificial (running number) and consists of values form 0 to 5.

这里的索引只有 1 级(只有一个索引值标识每一行)。该索引是人为的(运行编号),由 0 到 5 的值组成。

Say we want to merge (sum) all logs created by the same userto the same issue(to get the total time spent on the issue by the user)

假设我们要将同一用户创建的所有日志合并(汇总)到同一问题(以获取用户在该问题上花费的总时间)

time_logged_by_user = report.groupby(['IssueKey', 'User']).TimeSpent.sum()

IssueKey  User
1         Bob     10
          John    30
          Tom     30
2         Bob     15
          John    25

Now our data index has 2 levels, as multiple users logged time to the same issue. The levels are IssueKeyand User. The levels are parts of the index (only together they can identify a row in a DataFrame / Series).

现在我们的数据索引有 2 个级别,因为多个用户记录了同一个问题的时间。级别是IssueKeyUser。这些级别是索引的一部分(只有它们一起才能标识 DataFrame/Series 中的一行)。

Levels being parts of the index (as a tuple) can be nicely observed in the Spyder Variable explorer:

作为索引的一部分(作为元组)的级别可以在 Spyder 变量资源管理器中很好地观察到:

enter image description here

在此处输入图片说明

Having levels gives us opportunity to aggregate values within groups in respect to an index part (level) of our choice. E.g. if we want to assign the max time spent on an issue by any user, we can:

具有级别使我们有机会根据我们选择的索引部分(级别)聚合组内的值。例如,如果我们想分配任何用户在一个问题上花费的最大时间,我们可以:

max_time_logged_to_an_issue = time_logged_by_user.groupby(level='IssueKey').transform('max')

IssueKey  User
1         Bob     30
          John    30
          Tom     30
2         Bob     25
          John    25

Now the first 3 rows have the value 30, as they correspond to the issue 1(Userlevel was ignored in the code above). The same story for the issue 2.

现在前 3 行的值为30,因为它们对应于问题1User上面的代码中忽略了级别)。同样的故事的问题2

This can be useful e.g. if we want to find out which users spent most time on every issue:

这可能很有用,例如,如果我们想找出哪些用户在每个问题上花费的时间最多:

issue_owners = time_logged_by_user[time_logged_by_user == max_time_logged_to_an_issue]

IssueKey  User
1         John    30
          Tom     30
2         John    25

回答by John Zwinck

Usually a DataFrame has a 1D index and columns:

通常一个 DataFrame 有一个一维索引和列:

    x y
0   4 1
1   3 9

Here the index is [0, 1] and the columns are ['x', 'y']. But you can have multiple levels in either the index or the columns:

这里的索引是 [0, 1],列是 ['x', 'y']。但是您可以在索引或列中有多个级别:

    x y
    a b c
0 7 4 1 3
  8 3 9 5

Here the columns' first level is ['x', 'y', 'y'] and the second level is ['a', 'b', 'c']. The index's first level is [0, 0] and the second level is [7, 8].

这里列的第一级是 ['x', 'y', 'y'],第二级是 ['a', 'b', 'c']。索引的第一级是[0, 0],第二级是[7, 8]。