pandas 在熊猫中使用 iterrows 的 for 循环
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41297341/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
for loop using iterrows in pandas
提问by Gingerbread
I have 2 dataframes as follows:
我有 2 个数据框,如下所示:
data1 looks like this:
数据 1 看起来像这样:
id address
1 11123451
2 78947591
data2 looks like the following:
data2 如下所示:
lowerbound_address upperbound_address place
78392888 89000000 X
10000000 20000000 Y
I want to create another column in data1 called "place" which contains the place the id is from. For example, in the above case, for id 1, I want the place column to contain Y and for id 2, I want the place column to contain X. There will be many ids coming from the same place. And some ids don't have a match.
我想在 data1 中创建另一个名为“place”的列,其中包含 ID 来自的位置。例如,在上面的例子中,对于 id 1,我希望 place 列包含 Y,对于 id 2,我希望 place 列包含 X。会有很多 id 来自同一个地方。有些 id 没有匹配项。
I am trying to do it using the following piece of code.
我正在尝试使用以下代码段来做到这一点。
places = []
for index, row in data1.iterrows():
for idx, r in data2.iterrows():
if r['lowerbound_address'] <= row['address'] <= r['upperbound_address']:
places.append(r['place'])
The addresses here are float values.
这里的地址是浮点值。
It's taking forever to run this piece of code. It makes me wonder if my code is correct or if there's a faster way of executing the same.
运行这段代码需要很长时间。这让我想知道我的代码是否正确,或者是否有更快的执行方式。
Any help will be much appreciated. Thank you!
任何帮助都感激不尽。谢谢!
回答by jezrael
You can use first cross
join with merge
and then filter values by boolean indexing
. Last remove unecessary columns by drop
:
您可以使用 first cross
join withmerge
然后过滤值boolean indexing
。最后通过drop
以下方式删除不必要的列:
data1['tmp'] = 1
data2['tmp'] = 1
df = pd.merge(data1, data2, on='tmp', how='outer')
df = df[(df.lowerbound_address <= df.address) & (df.upperbound_address >= df.address)]
df = df.drop(['lowerbound_address','upperbound_address', 'tmp'], axis=1)
print (df)
id address place
1 1 11123451 Y
2 2 78947591 X
Another solution with itertuples
, last create DataFrame.from_records
:
使用itertuples
, last create 的另一种解决方案DataFrame.from_records
:
places = []
for row1 in data1.itertuples():
for row2 in data2.itertuples():
#print (row1.address)
if (row2.lowerbound_address <= row1.address <= row2.upperbound_address):
places.append((row1.id, row1.address, row2.place))
print (places)
[(1, 11123451, 'Y'), (2, 78947591, 'X')]
df = pd.DataFrame.from_records(places)
df.columns=['id','address','place']
print (df)
id address place
0 1 11123451 Y
1 2 78947591 X
Another solution with apply
:
另一个解决方案apply
:
def f(x):
for row2 in data2.itertuples():
if (row2.lowerbound_address <= x <= row2.upperbound_address):
return pd.Series([x, row2.place], index=['address','place'])
df = data1.set_index('id')['address'].apply(f).reset_index()
print (df)
id address place
0 1 11123451 Y
1 2 78947591 X
EDIT:
编辑:
Timings:
时间:
N = 1000
:
N = 1000
:
If saome values are not in range, in solution b
and c
are omited. Check last row of df1
.
如果saome值不在范围内,在溶液中b
和c
正在被遗漏。检查最后一行df1
。
In [73]: %timeit (data1.set_index('id')['address'].apply(f).reset_index())
1 loop, best of 3: 2.06 s per loop
In [74]: %timeit (a(df1a, df2a))
1 loop, best of 3: 82.2 ms per loop
In [75]: %timeit (b(df1b, df2b))
1 loop, best of 3: 3.17 s per loop
In [76]: %timeit (c(df1c, df2c))
100 loops, best of 3: 2.71 ms per loop
Code for timings:
计时代码:
np.random.seed(123)
N = 1000
data1 = pd.DataFrame({'id':np.arange(1,N+1),
'address': np.random.randint(N*10, size=N)}, columns=['id','address'])
#add last row with value out of range
data1.loc[data1.index[-1]+1, ['id','address']] = [data1.index[-1]+1, -1]
data1 = data1.astype(int)
print (data1.tail())
data2 = pd.DataFrame({'lowerbound_address':np.arange(1, N*10,10),
'upperbound_address':np.arange(10,N*10+10, 10),
'place': np.random.randint(40, size=N)})
print (data2.tail())
df1a, df1b, df1c = data1.copy(),data1.copy(),data1.copy()
df2a, df2b ,df2c = data2.copy(),data2.copy(),data2.copy()
def a(data1, data2):
data1['tmp'] = 1
data2['tmp'] = 1
df = pd.merge(data1, data2, on='tmp', how='outer')
df = df[(df.lowerbound_address <= df.address) & (df.upperbound_address >= df.address)]
df = df.drop(['lowerbound_address','upperbound_address', 'tmp'], axis=1)
return (df)
def b(data1, data2):
places = []
for row1 in data1.itertuples():
for row2 in data2.itertuples():
#print (row1.address)
if (row2.lowerbound_address <= row1.address <= row2.upperbound_address):
places.append((row1.id, row1.address, row2.place))
df = pd.DataFrame.from_records(places)
df.columns=['id','address','place']
return (df)
def f(x):
#use for ... else for add NaN to values out of range
#http://stackoverflow.com/q/9979970/2901002
for row2 in data2.itertuples():
if (row2.lowerbound_address <= x <= row2.upperbound_address):
return pd.Series([x, row2.place], index=['address','place'])
else:
return pd.Series([x, np.nan], index=['address','place'])
def c(data1,data2):
data1 = data1.sort_values('address')
data2 = data2.sort_values('lowerbound_address')
df = pd.merge_asof(data1, data2, left_on='address', right_on='lowerbound_address')
df = df.drop(['lowerbound_address','upperbound_address'], axis=1)
return df.sort_values('id')
print (data1.set_index('id')['address'].apply(f).reset_index())
print (a(df1a, df2a))
print (b(df1b, df2b))
print (c(df1c, df2c))
Only solution c
with merge_asof
works very nice with large DataFrame
:
唯一的解决方案c
与merge_asof
作品非常漂亮的大DataFrame
:
N=1M
:
N=1M
:
In [84]: %timeit (c(df1c, df2c))
1 loop, best of 3: 525 ms per loop
More about merge asof in docs.
更多关于在文档中合并 asof 的信息。