使用 geopy pandas 坐标的新列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31414481/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
new column with coordinates using geopy pandas
提问by Dave
I have a df:
我有一个 df:
import pandas as pd
import numpy as np
import datetime as DT
import hmac
from geopy.geocoders import Nominatim
from geopy.distance import vincenty
df
city_name state_name county_name
0 WASHINGTON DC DIST OF COLUMBIA
1 WASHINGTON DC DIST OF COLUMBIA
2 WASHINGTON DC DIST OF COLUMBIA
3 WASHINGTON DC DIST OF COLUMBIA
4 WASHINGTON DC DIST OF COLUMBIA
5 WASHINGTON DC DIST OF COLUMBIA
6 WASHINGTON DC DIST OF COLUMBIA
7 WASHINGTON DC DIST OF COLUMBIA
8 WASHINGTON DC DIST OF COLUMBIA
9 WASHINGTON DC DIST OF COLUMBIA
I want to get the latitude and longitude coordinates for any one of the columns in the data frame below. The documentation (http://geopy.readthedocs.org/en/latest/#data) is pretty straightforward when working with the documentation for individual locations.
我想获取下面数据框中任何一列的纬度和经度坐标。在处理各个位置的文档时,文档 ( http://geopy.readthedocs.org/en/latest/#data) 非常简单。
>>> from geopy.geocoders import Nominatim
>>> geolocator = Nominatim()
>>> location = geolocator.geocode("175 5th Avenue NYC")
>>> print(location.address)
Flatiron Building, 175, 5th Avenue, Flatiron, New York, NYC, New York, ...
>>> print((location.latitude, location.longitude))
(40.7410861, -73.9896297241625)
>>> print(location.raw)
{'place_id': '9167009604', 'type': 'attraction', ...}
However I want to apply the function to each row in the df and make a new column. I've tried the following
但是我想将该函数应用于 df 中的每一行并创建一个新列。我试过以下
df['city_coord'] = geolocator.geocode(lambda row: 'state_name' (row))
but I think I'm missing something in my code because I get the following:
但我想我的代码中遗漏了一些东西,因为我得到以下信息:
city_name state_name county_name coordinates
0 WASHINGTON DC DIST OF COLUMBIA None
1 WASHINGTON DC DIST OF COLUMBIA None
2 WASHINGTON DC DIST OF COLUMBIA None
3 WASHINGTON DC DIST OF COLUMBIA None
4 WASHINGTON DC DIST OF COLUMBIA None
5 WASHINGTON DC DIST OF COLUMBIA None
6 WASHINGTON DC DIST OF COLUMBIA None
7 WASHINGTON DC DIST OF COLUMBIA None
8 WASHINGTON DC DIST OF COLUMBIA None
9 WASHINGTON DC DIST OF COLUMBIA None
I would like something like this hopefully using the Lambda function:
我想要这样的东西,希望使用 Lambda 函数:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
1 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
2 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
3 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
4 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
5 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
6 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
7 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
8 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
9 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
10 GLYNCO GA GLYNN 31.2224512, -81.5101023
I appreciate any help. After I get the coordinates I'd like to map them. Any recommended resources for mapping coordinates is greatly appreciated too. thanks
我很感激任何帮助。获得坐标后,我想绘制它们。任何用于映射坐标的推荐资源也非常感谢。谢谢
采纳答案by EdChum
You can call applyand pass the function you want to execute on every row like the following:
您可以调用apply并传递要在每一行上执行的函数,如下所示:
In [9]:
geolocator = Nominatim()
df['city_coord'] = df['state_name'].apply(geolocator.geocode)
df
Out[9]:
city_name state_name county_name \
0 WASHINGTON DC DIST OF COLUMBIA
1 WASHINGTON DC DIST OF COLUMBIA
city_coord
0 (District of Columbia, United States of Americ...
1 (District of Columbia, United States of Americ...
You can then access the latitude and longitude attributes:
然后,您可以访问纬度和经度属性:
In [16]:
df['city_coord'] = df['city_coord'].apply(lambda x: (x.latitude, x.longitude))
df
Out[16]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
Or do it in a one liner by calling applytwice:
或者通过调用apply两次在一个班轮中完成:
In [17]:
df['city_coord'] = df['state_name'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df
Out[17]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
Also your attempt geolocator.geocode(lambda row: 'state_name' (row))did nothing hence why you have a column full of Nonevalues
您的尝试geolocator.geocode(lambda row: 'state_name' (row))也没有任何作用,因此为什么您有一列充满None值的
EDIT
编辑
@leb makes an interesting point here, if you have many duplicate values then it'll be more performant to geocode for each unique value and then add this:
@leb 在这里提出了一个有趣的观点,如果您有许多重复值,那么对每个唯一值进行地理编码然后添加以下内容会更高效:
In [38]:
states = df['state_name'].unique()
d = dict(zip(states, pd.Series(states).apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))))
d
Out[38]:
{'DC': (38.8937154, -76.9877934586326)}
In [40]:
df['city_coord'] = df['state_name'].map(d)
df
Out[40]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
So the above gets all the unique values using unique, constructs a dict from them and then calls mapto perform the lookup and add the coords, this will be more efficient than trying to geocode row-wise
因此,上面使用 获取所有唯一值unique,从中构造一个字典,然后调用map执行查找并添加坐标,这比尝试按行进行地理编码更有效
回答by Leb
Upvote and accept @EdChum's answer, I just wanted to add to this. His methods works perfect, but from personal experience I'd like to share a few things:
赞成并接受@EdChum 的回答,我只是想补充一下。他的方法很完美,但根据个人经验,我想分享一些事情:
When dealing with geocoding, if you have multiple city/state combination that are repeating, it's muchfaster to send only 1 to get geocoded and then replicate the rest to other rows below:
在处理地理编码时,如果您有多个重复的城市/州组合,那么只发送 1 个进行地理编码然后将其余的复制到下面的其他行会快得多:
This is veryhelpful for large data can be done through two ways:
这对大数据非常有帮助,可以通过两种方式完成:
- Based on your data only since the rows seem exact duplicate, and only if you want, drop the extra ones and execute geocoding to one of them. This can be done using
drop_duplicate - If you want to keep all your rows,
group_bythe city/state combination, apply geocoding to it the first one by callinghead(1), then duplicate to the remainder rows.
- 仅基于您的数据,因为这些行似乎完全重复,并且仅当您需要时,删除多余的行并对其中之一执行地理编码。这可以使用
drop_duplicate - 如果要保留所有行,
group_by即城市/州组合,请通过调用对第一个行应用地理编码head(1),然后复制到其余行。
Reason is each time you call on Nominatim there's a small latency issue even if you were queuing the same city/state in a row. This smalllatency gets worse when your data gets large causing a huge delay in response and possible time out.
原因是每次您调用 Nominatim 时,即使您连续排队同一个城市/州,也会出现一个小的延迟问题。当您的数据变大时,这种小延迟会变得更糟,从而导致巨大的响应延迟和可能的超时。
Again, this is all from personanly dealing with it. Just keep in mind for future use if it doesn't benefit you now.
同样,这一切都来自于亲自处理。如果它现在对您没有好处,请记住以备将来使用。

