Python 熊猫 - 找到第一次出现
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41255215/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas - find first occurrence
提问by sachinruk
Suppose I have a structured dataframe as follows:
假设我有一个结构化的数据框,如下所示:
df = pd.DataFrame({"A":['a','a','a','b','b'],
"B":[1]*5})
The Acolumn has previously been sorted. I wish to find the first row index of where df[df.A!='a']. The end goal is to use this index to break the data frame into groups based on A.
该A列之前已排序。我希望找到 where 的第一行索引df[df.A!='a']。最终目标是使用此索引将数据框基于A.
Now I realise that there is a groupby functionality. However, the dataframe is quite large and this is a simplified toy example. Since Ahas been sorted already, it would be faster if I can just find the 1st indexof where df.A!='a'. Therefore it is important that whatever method that you use the scanning stops once the first element is found.
现在我意识到有一个 groupby 功能。但是,数据框非常大,这是一个简化的玩具示例。由于A已经排序,如果我能找到where的第一个索引会更快df.A!='a'。因此,一旦找到第一个元素,无论您使用何种扫描方法,都非常重要。
回答by piRSquared
idxmaxand argmaxwill return the position of the maximal value or the first position if the maximal value occurs more than once.
idxmaxargmax如果最大值出现多次,则返回最大值的位置或第一个位置。
use idxmaxon df.A.ne('a')
使用idxmax上df.A.ne('a')
df.A.ne('a').idxmax()
3
or the numpyequivalent
或numpy等价物
(df.A.values != 'a').argmax()
3
However, if Ahas already been sorted, then we can use searchsorted
但是,如果A已经排序,那么我们可以使用searchsorted
df.A.searchsorted('a', side='right')
array([3])
Or the numpyequivalent
或numpy等价物
df.A.values.searchsorted('a', side='right')
3
回答by Anna K.
I found there is first_valid_index function for Pandas DataFrames that will do the job, one could use it as follows:
我发现 Pandas DataFrames 有 first_valid_index 函数可以完成这项工作,可以按如下方式使用它:
df[df.A!='a'].first_valid_index()
3
However, this function seems to be very slow. Even taking the first index of the filtered dataframe is faster:
但是,这个功能似乎很慢。即使采用过滤数据帧的第一个索引也更快:
df.loc[df.A!='a','A'].index[0]
Below I compare the total time(sec) of repeating calculations 100 times for these two options and all the codes above:
下面我比较了这两个选项和上面所有代码重复计算 100 次的总时间(秒):
total_time_sec ratio wrt fastest algo
searchsorted numpy: 0.0007 1.00
argmax numpy: 0.0009 1.29
for loop: 0.0045 6.43
searchsorted pandas: 0.0075 10.71
idxmax pandas: 0.0267 38.14
index[0]: 0.0295 42.14
first_valid_index pandas: 0.1181 168.71
Notice numpy's searchsorted is the winner and first_valid_index shows worst performance. Generally, numpy algorithms are faster, and the for loop does not do so bad, but it's just because the dataframe has very few entries.
请注意 numpy 的 searchsorted 是赢家,而 first_valid_index 表现出最差的性能。一般来说,numpy 算法更快,for 循环也没有那么糟糕,但这只是因为数据帧的条目很少。
For a dataframe with 10,000 entries where the desired entries are closer to the end the results are different, with searchsorted delivering the best performance:
对于具有 10,000 个条目的数据帧,其中所需条目更接近末尾,结果不同,searchsorted 提供最佳性能:
total_time_sec ratio wrt fastest algo
searchsorted numpy: 0.0007 1.00
searchsorted pandas: 0.0076 10.86
argmax numpy: 0.0117 16.71
index[0]: 0.0815 116.43
idxmax pandas: 0.0904 129.14
first_valid_index pandas: 0.1691 241.57
for loop: 9.6504 13786.29
The code to produce these results is below:
产生这些结果的代码如下:
import timeit
# code snippet to be executed only once
mysetup = '''import pandas as pd
import numpy as np
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
'''
# code snippets whose execution time is to be measured
mycode_set = ['''
df[df.A!='a'].first_valid_index()
''']
message = ["first_valid_index pandas:"]
mycode_set.append( '''df.loc[df.A!='a','A'].index[0]''')
message.append("index[0]: ")
mycode_set.append( '''df.A.ne('a').idxmax()''')
message.append("idxmax pandas: ")
mycode_set.append( '''(df.A.values != 'a').argmax()''')
message.append("argmax numpy: ")
mycode_set.append( '''df.A.searchsorted('a', side='right')''')
message.append("searchsorted pandas: ")
mycode_set.append( '''df.A.values.searchsorted('a', side='right')''' )
message.append("searchsorted numpy: ")
mycode_set.append( '''for index in range(len(df['A'])):
if df['A'][index] != 'a':
ans = index
break
''')
message.append("for loop: ")
total_time_in_sec = []
for i in range(len(mycode_set)):
mycode = mycode_set[i]
total_time_in_sec.append(np.round(timeit.timeit(setup = mysetup,\
stmt = mycode, number = 100),4))
output = pd.DataFrame(total_time_in_sec, index = message, \
columns = ['total_time_sec' ])
output["ratio wrt fastest algo"] = \
np.round(output.total_time_sec/output["total_time_sec"].min(),2)
output = output.sort_values(by = "total_time_sec")
display(output)
For the larger dataframe:
对于较大的数据框:
mysetup = '''import pandas as pd
import numpy as np
n = 10000
lt = ['a' for _ in range(n)]
b = ['b' for _ in range(5)]
lt[-5:] = b
df = pd.DataFrame({"A":lt,"B":[1]*n})
'''
回答by Vaishali
If you just want to find the first instance without going through the entire dataframe, you can go the for-loop way.
如果您只想找到第一个实例而不遍历整个数据帧,则可以使用 for 循环方式。
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
for index in range(len(df['A'])):
if df['A'][index] != 'a':
print(index)
break
The index is the row number of the 1st index of where df.A!='a'
索引是第一个索引的行号 where df.A!='a'
回答by Alaa M.
For multiple conditions:
对于多个条件:
Let's say we have:
假设我们有:
s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])
And we want to find the first item different than aand c, we do:
我们想要找到不同于a和c的第一项,我们这样做:
n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
Times:
次数:
import numpy as np
import pandas as pd
from datetime import datetime
ITERS = 1000
def pandas_multi_condition(s):
ts = datetime.now()
for i in range(ITERS):
n = s[(s != 'a') & (s != 'c')].index[0]
print(n)
print(datetime.now() - ts)
def numpy_bitwise_and(s):
ts = datetime.now()
for i in range(ITERS):
n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
print(n)
print(datetime.now() - ts)
s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])
print('pandas_multi_condition():')
pandas_multi_condition(s)
print()
print('numpy_bitwise_and():')
numpy_bitwise_and(s)
Output:
输出:
pandas_multi_condition():
4
0:00:01.144767
numpy_bitwise_and():
4
0:00:00.019013
回答by André de Mattos Ferraz
You can iterate by dataframe rows (it is slow) and create your own logic to get values that you wanted:
您可以通过数据帧行进行迭代(它很慢)并创建自己的逻辑来获取您想要的值:
def getMaxIndex(df, col)
max = -999999
rtn_index = 0
for index, row in df.iterrows():
if row[col] > max:
max = row[col]
rtn_index = index
return rtn_index

