Python 如何在 Pandas 中查询 MultiIndex 索引列值

Question

提问by Vyacheslav Shkolyar

Code example:

代码示例：

In [171]: A = np.array([1.1, 1.1, 3.3, 3.3, 5.5, 6.6])

In [172]: B = np.array([111, 222, 222, 333, 333, 777])

In [173]: C = randint(10, 99, 6)

In [174]: df = pd.DataFrame(zip(A, B, C), columns=['A', 'B', 'C'])

In [175]: df.set_index(['A', 'B'], inplace=True)

In [176]: df
Out[176]: 
          C
A   B      
1.1 111  20
    222  31
3.3 222  24
    333  65
5.5 333  22
6.6 777  74

Now, I want to retrieve A values:
Q1: in range [3.3, 6.6] - expected return value: [3.3, 5.5, 6.6] or [3.3, 3.3, 5.5, 6.6] in case last inclusive, and [3.3, 5.5] or [3.3, 3.3, 5.5] if not.
Q2: in range [2.0, 4.0] - expected return value: [3.3] or [3.3, 3.3]

现在，我想检索 A 值：
Q1：在 [3.3, 6.6] 范围内 - 预期返回值：[3.3, 5.5, 6.6] 或 [3.3, 3.3, 5.5, 6.6] 以防最后包括在内，以及 [3.3, 5.5 ] 或 [3.3, 3.3, 5.5] 如果不是。
Q2：在 [2.0, 4.0] 范围内 - 预期回报值：[3.3] 或 [3.3, 3.3]

Same for any other MultiIndexdimension, for example B values:
Q3: in range [111, 500] with repetitions, as number of data rows in range - expected return value: [111, 222, 222, 333, 333]

与任何其他MultiIndex维度相同，例如 B 值：
Q3：在 [111, 500] 范围内重复，作为范围内的数据行数 - 预期返回值：[111, 222, 222, 333, 333]

More formal:

更正式：

Let us assume T is a table with columns A, B and C. The table includes nrows. Table cells are numbers, for example A double, B and C integers. Let's create a DataFrameof table T, let us name it DF. Let's set columns A and B indexes of DF (without duplication, i.e. no separate columns A and B as indexes, and separate as data), i.e. A and B in this case MultiIndex.

让我们假设 T 是一个包含 A、B 和 C 列的表。该表包括n行。表格单元格是数字，例如 A double、B 和 C 整数。让我们创建一个表 T的DataFrame，让我们将其命名为 DF。让我们设置 DF 的 A 和 B 列索引（没有重复，即没有单独的 A 和 B 列作为索引，并作为数据分开），即 A 和 B 在这种情况下MultiIndex。

Questions:

问题：

How to write a query on the index, for example, to query the index A (or B), say in the labels interval [120.0, 540.0]? Labels 120.0 and 540.0 exist. I must clarify that I am interested only in the list of indices as a response to the query!
How to the same, but in case of the labels 120.0 and 540.0 do not exist, but there are labels by value lower than 120, higher than 120 and less than 540, or higher than 540?
In case the answer for Q1 and Q2 was unique index values, now the same, but with repetitions, as number of data rows in index range.

如何编写对索引的查询，例如，在标签区间 [120.0, 540.0] 中查询索引 A（或 B）？存在标签 120.0 和 540.0。我必须澄清，我只对作为查询响应的索引列表感兴趣！
如何相同，但如果标签120.0和540.0不存在，但有低于120、高于120且低于540或高于540的值的标签？
如果 Q1 和 Q2 的答案是唯一索引值，现在相同，但有重复，作为索引范围内的数据行数。

I know the answers to the above questions in the case of columns which are not indexes, but in the indexes case, after a long research in the web and experimentation with the functionality of pandas, I did not succeed. The only method (without additional programming) I see now is to have a duplicate of A and B as data columns in addition to index.

对于不是索引的列，我知道上述问题的答案，但是在索引的情况下，经过长时间的网络研究和对pandas功能的实验，我没有成功。我现在看到的唯一方法（无需额外编程）是将 A 和 B 的副本作为除索引之外的数据列。

Answer 1

采纳答案by Vyacheslav Shkolyar

To query the dfby the MultiIndexvalues, for example where (A > 1.7) and (B < 666):

要通过MultiIndex值查询df，例如 where (A > 1.7) 和 (B < 666)：

In [536]: result_df = df.loc[(df.index.get_level_values('A') > 1.7) & (df.index.get_level_values('B') < 666)]

In [537]: result_df
Out[537]: 
          C
A   B      
3.3 222  43
    333  59
5.5 333  56

Hence, to get for example the 'A'index values, if still required:

因此，如果仍然需要，例如获取“A”索引值：

In [538]: result_df.index.get_level_values('A')
Out[538]: Index([3.3, 3.3, 5.5], dtype=object)

The problem is, that in large data frames the performance of by indexselection worse by 10% than the sorted regular rows selection. And in repetitive work, looping, the delay accumulated. See example:

问题是，在大型数据帧中，按索引选择的性能比已排序的常规行选择差 10%。而在重复性工作中，循环往复，延迟不断累积。见示例：

In [558]: df = store.select(STORE_EXTENT_BURSTS_DF_KEY)

In [559]: len(df)
Out[559]: 12857

In [560]: df.sort(inplace=True)

In [561]: df_without_index = df.reset_index()

In [562]: %timeit df.loc[(df.index.get_level_values('END_TIME') > 358200) & (df.index.get_level_values('START_TIME') < 361680)]
1000 loops, best of 3: 562 μs per loop

In [563]: %timeit df_without_index[(df_without_index.END_TIME > 358200) & (df_without_index.START_TIME < 361680)]
1000 loops, best of 3: 507 μs per loop

Answer 2

回答by Jeff

With a 'float' like index you always want to use it as a column rather than a direct indexing action. These will all work whether the endpoints exist or not.

对于像“浮动”这样的索引，您总是希望将其用作列而不是直接索引操作。无论端点是否存在，这些都将起作用。

In [11]: df
Out[11]: 
          C
A   B      
1.1 111  81
    222  45
3.3 222  98
    333  13
5.5 333  89
6.6 777  98

In [12]: x = df.reset_index()

Q1

第一季度

In [13]: x.loc[(x.A>=3.3)&(x.A<=6.6)]
Out[13]: 
     A    B   C
2  3.3  222  98
3  3.3  333  13
4  5.5  333  89
5  6.6  777  98

Q2

In [14]: x.loc[(x.A>=2.0)&(x.A<=4.0)]
Out[14]: 
     A    B   C
2  3.3  222  98
3  3.3  333  13

Q3

In [15]: x.loc[(x.B>=111.0)&(x.B<=500.0)]
Out[15]: 
     A    B   C
0  1.1  111  81
1  1.1  222  45
2  3.3  222  98
3  3.3  333  13
4  5.5  333  89

If you want the indices back, just set them. This is a cheap operation.

如果你想要索引回来，只需设置它们。这是一个廉价的操作。

In [16]: x.loc[(x.B>=111.0)&(x.B<=500.0)].set_index(['A','B'])
Out[16]: 
          C
A   B      
1.1 111  81
    222  45
3.3 222  98
    333  13
5.5 333  89

If you REALLY want the actual index values

如果你真的想要实际的索引值

In [5]: x.loc[(x.B>=111.0)&(x.B<=500.0)].set_index(['A','B']).index
Out[5]: 
MultiIndex
[(1.1, 111), (1.1, 222), (3.3, 222), (3.3, 333), (5.5, 333)]

Answer 3

回答by YaOzI

For better readability, we can simply use the query()Method, to avoid the lengthy df.index.get_level_values()and reset_index/set_indexto and fro.

为了更好的可读性，我们可以简单地使用的query()方法，以避免冗长df.index.get_level_values()和reset_index/set_index来回。

Here is the target DataFrame:

这是目标DataFrame：

In [12]: df                                                                    
Out[12]:                                                                       
          C                                                                    
A   B                                                                          
1.1 111  68                                                                    
    222  40                                                                    
3.3 222  20                                                                    
    333  11                                                                    
5.5 333  80                                                                    
6.6 777  51

Answer for Q1(Ain range [3.3, 6.6]):

Q1 的答案（A在范围内[3.3, 6.6]）：

In [13]: df.query('3.3 <= A <= 6.6') # for closed interval                       
Out[13]:                                                                       
          C                                                                    
A   B                                                                          
3.3 222  20                                                                    
    333  11                                                                    
5.5 333  80                                                                    
6.6 777  51                                                                    

In [14]: df.query('3.3 < A < 6.6') # for open interval                         
Out[14]:                                                                       
          C                                                                    
A   B                                                                          
5.5 333  80

and of course one can play around with <, <=, >, >=for any kind of inclusion.

当然，你可以玩弄<, <=, >, >=任何类型的包容。

Similarly, answer for Q2(Ain range [2.0, 4.0]):

同样，Q2 的答案（A在范围内[2.0, 4.0]）：

In [15]: df.query('2.0 <= A <= 4.0')                                        
Out[15]:                                                                    
          C                                                                 
A   B                                                                       
3.3 222  20                                                                 
    333  11

Answer for Q3(Bin range [111, 500]):

Q3 的答案（B在范围内[111, 500]）：

In [16]: df.query('111 <= B <= 500')                                        
Out[16]:                                                                    
          C                                                                 
A   B                                                                       
1.1 111  68                                                                 
    222  40                                                                 
3.3 222  20                                                                 
    333  11                                                                 
5.5 333  80

And moreover, you can COMBINEthe query for col Aand Bvery naturally!

而且，你可以联合收割机的查询栏A和B非常自然！

In [17]: df.query('0 < A < 4 and 150 < B < 400')                            
Out[17]:                                                                    
          C                                                                 
A   B                                                                       
1.1 222  40                                                                 
3.3 222  20                                                                 
    333  11

Python 如何在 Pandas 中查询 MultiIndex 索引列值

提问by Vyacheslav Shkolyar

采纳答案by Vyacheslav Shkolyar

回答by Jeff

回答by YaOzI

相关推荐

最近更新

标签

Python 如何在 Pandas 中查询 MultiIndex 索引列值

提问by Vyacheslav Shkolyar

采纳答案by Vyacheslav Shkolyar

回答by Jeff

回答by YaOzI

相关推荐

Python 2.7：无法导入 matplotlib.pyplot

使用python进行梯度计算

Python 使用 seaborn 的简单线图

Python - 两个字符串之间的区别

相关推荐

最近更新

标签