Python 如何在 Pandas 中查询 MultiIndex 索引列值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17921010/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to query MultiIndex index columns values in pandas
提问by Vyacheslav Shkolyar
Code example:
代码示例:
In [171]: A = np.array([1.1, 1.1, 3.3, 3.3, 5.5, 6.6])
In [172]: B = np.array([111, 222, 222, 333, 333, 777])
In [173]: C = randint(10, 99, 6)
In [174]: df = pd.DataFrame(zip(A, B, C), columns=['A', 'B', 'C'])
In [175]: df.set_index(['A', 'B'], inplace=True)
In [176]: df
Out[176]:
C
A B
1.1 111 20
222 31
3.3 222 24
333 65
5.5 333 22
6.6 777 74
Now, I want to retrieve A values:
Q1: in range [3.3, 6.6] - expected return value: [3.3, 5.5, 6.6] or [3.3, 3.3, 5.5, 6.6] in case last inclusive, and [3.3, 5.5] or [3.3, 3.3, 5.5] if not.
Q2: in range [2.0, 4.0] - expected return value: [3.3] or [3.3, 3.3]
现在,我想检索 A 值:
Q1:在 [3.3, 6.6] 范围内 - 预期返回值:[3.3, 5.5, 6.6] 或 [3.3, 3.3, 5.5, 6.6] 以防最后包括在内,以及 [3.3, 5.5 ] 或 [3.3, 3.3, 5.5] 如果不是。
Q2:在 [2.0, 4.0] 范围内 - 预期回报值:[3.3] 或 [3.3, 3.3]
Same for any other MultiIndexdimension, for example B values:
Q3: in range [111, 500] with repetitions, as number of data rows in range - expected return value: [111, 222, 222, 333, 333]
与任何其他MultiIndex维度相同,例如 B 值:
Q3:在 [111, 500] 范围内重复,作为范围内的数据行数 - 预期返回值:[111, 222, 222, 333, 333]
More formal:
更正式:
Let us assume T is a table with columns A, B and C. The table includes nrows. Table cells are numbers, for example A double, B and C integers. Let's create a DataFrameof table T, let us name it DF. Let's set columns A and B indexes of DF (without duplication, i.e. no separate columns A and B as indexes, and separate as data), i.e. A and B in this case MultiIndex.
让我们假设 T 是一个包含 A、B 和 C 列的表。该表包括n行。表格单元格是数字,例如 A double、B 和 C 整数。让我们创建一个表 T的DataFrame,让我们将其命名为 DF。让我们设置 DF 的 A 和 B 列索引(没有重复,即没有单独的 A 和 B 列作为索引,并作为数据分开),即 A 和 B 在这种情况下MultiIndex。
Questions:
问题:
- How to write a query on the index, for example, to query the index A (or B), say in the labels interval [120.0, 540.0]? Labels 120.0 and 540.0 exist. I must clarify that I am interested only in the list of indices as a response to the query!
- How to the same, but in case of the labels 120.0 and 540.0 do not exist, but there are labels by value lower than 120, higher than 120 and less than 540, or higher than 540?
- In case the answer for Q1 and Q2 was unique index values, now the same, but with repetitions, as number of data rows in index range.
- 如何编写对索引的查询,例如,在标签区间 [120.0, 540.0] 中查询索引 A(或 B)?存在标签 120.0 和 540.0。我必须澄清,我只对作为查询响应的索引列表感兴趣!
- 如何相同,但如果标签120.0和540.0不存在,但有低于120、高于120且低于540或高于540的值的标签?
- 如果 Q1 和 Q2 的答案是唯一索引值,现在相同,但有重复,作为索引范围内的数据行数。
I know the answers to the above questions in the case of columns which are not indexes, but in the indexes case, after a long research in the web and experimentation with the functionality of pandas, I did not succeed. The only method (without additional programming) I see now is to have a duplicate of A and B as data columns in addition to index.
对于不是索引的列,我知道上述问题的答案,但是在索引的情况下,经过长时间的网络研究和对pandas功能的实验,我没有成功。我现在看到的唯一方法(无需额外编程)是将 A 和 B 的副本作为除索引之外的数据列。
采纳答案by Vyacheslav Shkolyar
To query the dfby the MultiIndexvalues, for example where (A > 1.7) and (B < 666):
要通过MultiIndex值查询df,例如 where (A > 1.7) 和 (B < 666):
In [536]: result_df = df.loc[(df.index.get_level_values('A') > 1.7) & (df.index.get_level_values('B') < 666)]
In [537]: result_df
Out[537]:
C
A B
3.3 222 43
333 59
5.5 333 56
Hence, to get for example the 'A'index values, if still required:
因此,如果仍然需要,例如获取“A”索引值:
In [538]: result_df.index.get_level_values('A')
Out[538]: Index([3.3, 3.3, 5.5], dtype=object)
The problem is, that in large data frames the performance of by indexselection worse by 10% than the sorted regular rows selection. And in repetitive work, looping, the delay accumulated. See example:
问题是,在大型数据帧中,按索引选择的性能比已排序的常规行选择差 10%。而在重复性工作中,循环往复,延迟不断累积。见示例:
In [558]: df = store.select(STORE_EXTENT_BURSTS_DF_KEY)
In [559]: len(df)
Out[559]: 12857
In [560]: df.sort(inplace=True)
In [561]: df_without_index = df.reset_index()
In [562]: %timeit df.loc[(df.index.get_level_values('END_TIME') > 358200) & (df.index.get_level_values('START_TIME') < 361680)]
1000 loops, best of 3: 562 μs per loop
In [563]: %timeit df_without_index[(df_without_index.END_TIME > 358200) & (df_without_index.START_TIME < 361680)]
1000 loops, best of 3: 507 μs per loop
回答by Jeff
With a 'float' like index you always want to use it as a column rather than a direct indexing action. These will all work whether the endpoints exist or not.
对于像“浮动”这样的索引,您总是希望将其用作列而不是直接索引操作。无论端点是否存在,这些都将起作用。
In [11]: df
Out[11]:
C
A B
1.1 111 81
222 45
3.3 222 98
333 13
5.5 333 89
6.6 777 98
In [12]: x = df.reset_index()
Q1
第一季度
In [13]: x.loc[(x.A>=3.3)&(x.A<=6.6)]
Out[13]:
A B C
2 3.3 222 98
3 3.3 333 13
4 5.5 333 89
5 6.6 777 98
Q2
Q2
In [14]: x.loc[(x.A>=2.0)&(x.A<=4.0)]
Out[14]:
A B C
2 3.3 222 98
3 3.3 333 13
Q3
Q3
In [15]: x.loc[(x.B>=111.0)&(x.B<=500.0)]
Out[15]:
A B C
0 1.1 111 81
1 1.1 222 45
2 3.3 222 98
3 3.3 333 13
4 5.5 333 89
If you want the indices back, just set them. This is a cheap operation.
如果你想要索引回来,只需设置它们。这是一个廉价的操作。
In [16]: x.loc[(x.B>=111.0)&(x.B<=500.0)].set_index(['A','B'])
Out[16]:
C
A B
1.1 111 81
222 45
3.3 222 98
333 13
5.5 333 89
If you REALLY want the actual index values
如果你真的想要实际的索引值
In [5]: x.loc[(x.B>=111.0)&(x.B<=500.0)].set_index(['A','B']).index
Out[5]:
MultiIndex
[(1.1, 111), (1.1, 222), (3.3, 222), (3.3, 333), (5.5, 333)]
回答by YaOzI
For better readability, we can simply use the query()
Method, to avoid the lengthy df.index.get_level_values()
and reset_index
/set_index
to and fro.
为了更好的可读性,我们可以简单地使用的query()
方法,以避免冗长df.index.get_level_values()
和reset_index
/set_index
来回。
Here is the target DataFrame
:
这是目标DataFrame
:
In [12]: df
Out[12]:
C
A B
1.1 111 68
222 40
3.3 222 20
333 11
5.5 333 80
6.6 777 51
Answer for Q1(A
in range [3.3, 6.6]
):
Q1 的答案(A
在范围内[3.3, 6.6]
):
In [13]: df.query('3.3 <= A <= 6.6') # for closed interval
Out[13]:
C
A B
3.3 222 20
333 11
5.5 333 80
6.6 777 51
In [14]: df.query('3.3 < A < 6.6') # for open interval
Out[14]:
C
A B
5.5 333 80
and of course one can play around with <, <=, >, >=
for any kind of inclusion.
当然,你可以玩弄<, <=, >, >=
任何类型的包容。
Similarly, answer for Q2(A
in range [2.0, 4.0]
):
同样,Q2 的答案(A
在范围内[2.0, 4.0]
):
In [15]: df.query('2.0 <= A <= 4.0')
Out[15]:
C
A B
3.3 222 20
333 11
Answer for Q3(B
in range [111, 500]
):
Q3 的答案(B
在范围内[111, 500]
):
In [16]: df.query('111 <= B <= 500')
Out[16]:
C
A B
1.1 111 68
222 40
3.3 222 20
333 11
5.5 333 80
And moreover, you can COMBINEthe query for col A
and B
very naturally!
而且,你可以联合收割机的查询栏A
和B
非常自然!
In [17]: df.query('0 < A < 4 and 150 < B < 400')
Out[17]:
C
A B
1.1 222 40
3.3 222 20
333 11