Python 在熊猫系列中查找元素的索引
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18327624/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find element's index in pandas Series
提问by sashkello
I know this is a very basic question but for some reason I can't find an answer. How can I get the index of certain element of a Series in python pandas? (first occurrence would suffice)
我知道这是一个非常基本的问题,但由于某种原因我找不到答案。如何在 python pandas 中获取某个系列元素的索引?(第一次出现就足够了)
I.e., I'd like something like:
即,我想要类似的东西:
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3
Certainly, it is possible to define such a method with a loop:
当然,可以用循环定义这样的方法:
def find(s, el):
for i in s.index:
if s[i] == el:
return i
return None
print find(myseries, 7)
but I assume there should be a better way. Is there?
但我认为应该有更好的方法。在那儿?
采纳答案by Viktor Kerkez
>>> myseries[myseries == 7]
3 7
dtype: int64
>>> myseries[myseries == 7].index[0]
3
Though I admit that there should be a better way to do that, but this at least avoids iterating and looping through the object and moves it to the C level.
虽然我承认应该有更好的方法来做到这一点,但这至少避免了迭代和循环对象并将其移动到 C 级别。
回答by Jeff
Converting to an Index, you can use get_loc
转换为索引,您可以使用 get_loc
In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
In [3]: Index(myseries).get_loc(7)
Out[3]: 3
In [4]: Index(myseries).get_loc(10)
KeyError: 10
Duplicate handling
重复处理
In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)
Will return a boolean array if non-contiguous returns
如果不连续返回,将返回一个布尔数组
In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False, True, False, False, True, False], dtype=bool)
Uses a hashtable internally, so fast
在内部使用哈希表,如此之快
In [7]: s = Series(randint(0,10,10000))
In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 μs per loop
In [12]: i = Index(s)
In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 μs per loop
As Viktor points out, there is a one-time creation overhead to creating an index (its incurred when you actually DO something with the index, e.g. the is_unique
)
正如 Viktor 指出的那样,创建索引有一次性的创建开销(当您实际对索引执行某些操作时会产生这种开销,例如is_unique
)
In [2]: s = Series(randint(0,10,10000))
In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 μs per loop
In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 μs per loop
回答by Alex Spangher
Another way to do this, although equally unsatisfying is:
另一种方法来做到这一点,虽然同样不令人满意的是:
s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])
list(s).index(7)
returns: 3
回报:3
On time tests using a current dataset I'm working with (consider it random):
使用我正在使用的当前数据集进行时间测试(认为它是随机的):
[64]: %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 μs per loop
In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 μs per loop
In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 μs per loop
回答by Alon
In [92]: (myseries==7).argmax()
Out[92]: 3
This works if you know 7 is there in advance. You can check this with (myseries==7).any()
如果您提前知道 7 存在,则此方法有效。你可以用 (myseries==7).any() 检查这个
Another approach (very similar to the first answer) that also accounts for multiple 7's (or none) is
另一种方法(与第一个答案非常相似)也考虑了多个 7(或没有)是
In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']
回答by Alex
If you use numpy, you can get an array of the indecies that your value is found:
如果您使用 numpy,您可以获得找到您的值的 indecies 数组:
import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)
This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:
这将返回一个包含 indecies 数组的单元素元组,其中 7 是 myseries 中的值:
(array([3], dtype=int64),)
回答by Raki Gade
you can use Series.idxmax()
你可以使用 Series.idxmax()
>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>>
回答by Ulf Aslak
Often your value occurs at multiple indices:
通常,您的值出现在多个索引处:
>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')
回答by Bill
I'm impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.
我对这里的所有答案印象深刻。这不是一个新答案,只是试图总结所有这些方法的时间。我考虑了具有 25 个元素的系列的情况,并假设了索引可以包含任何值的一般情况,并且您希望索引值对应于接近系列末尾的搜索值。
Here are the speed tests on a 2013 MacBook Pro in Python 3.7 with Pandas version 0.25.3.
以下是使用 Python 3.7 和 Pandas 版本 0.25.3 在 2013 年 MacBook Pro 上的速度测试。
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = [406400, 203200, 101600, 76100, 50800, 25400, 19050, 12700,
...: 9500, 6700, 4750, 3350, 2360, 1700, 1180, 850,
...: 600, 425, 300, 212, 150, 106, 75, 53,
...: 38]
In [4]: myseries = pd.Series(data, index=range(1,26))
In [5]: myseries[21]
Out[5]: 150
In [7]: %timeit myseries[myseries == 150].index[0]
416 μs ± 5.05 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit myseries[myseries == 150].first_valid_index()
585 μs ± 32.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit myseries.where(myseries == 150).first_valid_index()
652 μs ± 23.3 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [10]: %timeit myseries.index[np.where(myseries == 150)[0][0]]
195 μs ± 1.18 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [11]: %timeit pd.Series(myseries.index, index=myseries)[150]
178 μs ± 9.35 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [12]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]
77.4 μs ± 1.41 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [13]: %timeit myseries.index[list(myseries).index(150)]
12.7 μs ± 42.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit myseries.index[myseries.tolist().index(150)]
9.46 μs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
@Jeff's answer seems to be the fastest - although it doesn't handle duplicates.
@Jeff 的答案似乎是最快的——尽管它不处理重复项。
Correction: Sorry, I missed one, @Alex Spangher's solution using the list index method is by far the fastest.
更正:对不起,我错过了一个,@Alex Spangher 使用列表索引方法的解决方案是迄今为止最快的。
Update: Added @EliadL's answer.
更新:添加了@EliadL 的答案。
Hope this helps.
希望这可以帮助。
Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.
令人惊讶的是,如此简单的操作需要如此复杂的解决方案,而且许多解决方案如此缓慢。在某些情况下超过半毫秒才能在 25 的系列中找到一个值。
回答by rmutalik
Another way to do it that hasn't been mentioned yet is the tolist method:
另一种尚未提及的方法是 tolist 方法:
myseries.tolist().index(7)
should return the correct index, assuming the value exists in the Series.
应该返回正确的索引,假设值存在于系列中。
回答by EliadL
This is the most native and scalable approach I could find:
这是我能找到的最原生和可扩展的方法:
>>> myindex = pd.Series(myseries.index, index=myseries)
>>> myindex[7]
3
>>> myindex[[7, 5, 7]]
7 3
5 4
7 3
dtype: int64