pandas 在熊猫中使用 iterrows 的 for 循环

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41297341/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:39:52  来源:igfitidea点击:

for loop using iterrows in pandas

pythonpandas

提问by Gingerbread

I have 2 dataframes as follows:

我有 2 个数据框,如下所示:

data1 looks like this:

数据 1 看起来像这样:

id          address       
1          11123451
2          78947591

data2 looks like the following:

data2 如下所示:

lowerbound_address   upperbound_address    place
78392888                 89000000            X
10000000                 20000000            Y

I want to create another column in data1 called "place" which contains the place the id is from. For example, in the above case, for id 1, I want the place column to contain Y and for id 2, I want the place column to contain X. There will be many ids coming from the same place. And some ids don't have a match.

我想在 data1 中创建另一个名为“place”的列,其中包含 ID 来自的位置。例如,在上面的例子中,对于 id 1,我希望 place 列包含 Y,对于 id 2,我希望 place 列包含 X。会有很多 id 来自同一个地方。有些 id 没有匹配项。

I am trying to do it using the following piece of code.

我正在尝试使用以下代码段来做到这一点。

places = []
    for index, row in data1.iterrows():
        for idx, r in data2.iterrows():
            if r['lowerbound_address'] <= row['address'] <= r['upperbound_address']:
                places.append(r['place'])

The addresses here are float values.

这里的地址是浮点值。

It's taking forever to run this piece of code. It makes me wonder if my code is correct or if there's a faster way of executing the same.

运行这段代码需要很长时间。这让我想知道我的代码是否正确,或者是否有更快的执行方式。

Any help will be much appreciated. Thank you!

任何帮助都感激不尽。谢谢!

回答by jezrael

You can use first crossjoin with mergeand then filter values by boolean indexing. Last remove unecessary columns by drop:

您可以使用 first crossjoin withmerge然后过滤值boolean indexing。最后通过drop以下方式删除不必要的列:

data1['tmp'] = 1
data2['tmp'] = 1
df = pd.merge(data1, data2, on='tmp', how='outer')
df = df[(df.lowerbound_address <= df.address) & (df.upperbound_address >= df.address)]
df = df.drop(['lowerbound_address','upperbound_address', 'tmp'], axis=1)
print (df)
   id   address place
1   1  11123451     Y
2   2  78947591     X

Another solution with itertuples, last create DataFrame.from_records:

使用itertuples, last create 的另一种解决方案DataFrame.from_records

places = []
for row1 in data1.itertuples():
    for row2 in data2.itertuples():
        #print (row1.address)
        if (row2.lowerbound_address <= row1.address <= row2.upperbound_address):
            places.append((row1.id, row1.address, row2.place))    
print (places)
[(1, 11123451, 'Y'), (2, 78947591, 'X')]

df = pd.DataFrame.from_records(places)
df.columns=['id','address','place']
print (df)
   id   address place
0   1  11123451     Y
1   2  78947591     X

Another solution with apply:

另一个解决方案apply

def f(x):
    for row2 in data2.itertuples():
        if (row2.lowerbound_address <= x <= row2.upperbound_address):
            return pd.Series([x, row2.place], index=['address','place'])

df = data1.set_index('id')['address'].apply(f).reset_index()
print (df)
   id   address place
0   1  11123451     Y
1   2  78947591     X

EDIT:

编辑:

Timings:

时间

N = 1000:

N = 1000

If saome values are not in range, in solution band care omited. Check last row of df1.

如果saome值不在范围内,在溶液中bc正在被遗漏。检查最后一行df1

In [73]: %timeit (data1.set_index('id')['address'].apply(f).reset_index())
1 loop, best of 3: 2.06 s per loop

In [74]: %timeit (a(df1a, df2a))
1 loop, best of 3: 82.2 ms per loop

In [75]: %timeit (b(df1b, df2b))
1 loop, best of 3: 3.17 s per loop

In [76]: %timeit (c(df1c, df2c))
100 loops, best of 3: 2.71 ms per loop

Code for timings:

计时代码

np.random.seed(123)
N = 1000
data1 = pd.DataFrame({'id':np.arange(1,N+1), 
                   'address': np.random.randint(N*10, size=N)}, columns=['id','address'])

#add last row with value out of range
data1.loc[data1.index[-1]+1, ['id','address']] = [data1.index[-1]+1, -1]
data1 = data1.astype(int)
print (data1.tail())

data2 = pd.DataFrame({'lowerbound_address':np.arange(1, N*10,10), 
                      'upperbound_address':np.arange(10,N*10+10, 10),
                      'place': np.random.randint(40, size=N)})

print (data2.tail())
df1a, df1b, df1c = data1.copy(),data1.copy(),data1.copy()
df2a, df2b ,df2c = data2.copy(),data2.copy(),data2.copy()


def a(data1, data2):
    data1['tmp'] = 1
    data2['tmp'] = 1
    df = pd.merge(data1, data2, on='tmp', how='outer')
    df = df[(df.lowerbound_address <= df.address) & (df.upperbound_address >= df.address)]
    df = df.drop(['lowerbound_address','upperbound_address', 'tmp'], axis=1)
    return (df)


def b(data1, data2):
    places = []
    for row1 in data1.itertuples():
        for row2 in data2.itertuples():
            #print (row1.address)
            if (row2.lowerbound_address <= row1.address <= row2.upperbound_address):
                places.append((row1.id, row1.address, row2.place))    

        df = pd.DataFrame.from_records(places)
        df.columns=['id','address','place']

    return (df)


def f(x):
    #use for ... else for add NaN to values out of range
    #http://stackoverflow.com/q/9979970/2901002
    for row2 in data2.itertuples():
        if (row2.lowerbound_address <= x <= row2.upperbound_address):
             return pd.Series([x, row2.place], index=['address','place'])
    else:
        return pd.Series([x, np.nan], index=['address','place'])


def c(data1,data2):
    data1 = data1.sort_values('address')
    data2 = data2.sort_values('lowerbound_address')
    df = pd.merge_asof(data1, data2, left_on='address', right_on='lowerbound_address')
    df = df.drop(['lowerbound_address','upperbound_address'], axis=1)
    return df.sort_values('id')


print (data1.set_index('id')['address'].apply(f).reset_index())
print (a(df1a, df2a))
print (b(df1b, df2b))
print (c(df1c, df2c))

Only solution cwith merge_asofworks very nice with large DataFrame:

唯一的解决方案cmerge_asof作品非常漂亮的大DataFrame

N=1M:

N=1M

In [84]: %timeit (c(df1c, df2c))
1 loop, best of 3: 525 ms per loop

More about merge asof in docs.

更多关于在文档中合并 asof 的信息