使用字符串搜索 Pandas 系列会产生 KeyError
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26005424/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Searching a Pandas series using a string produces a KeyError
提问by Jason
I'm trying to use df[df['col'].str.contains("string")](described in these two SO questions: 1& 2) to select rows based on a partial string match. Here's my code:
我正在尝试使用df[df['col'].str.contains("string")](在这两个 SO 问题中描述:1& 2)基于部分字符串匹配来选择行。这是我的代码:
import requests
import json
import pandas as pd
import datetime
url = "http://api.turfgame.com/v4/zones/all" # get request returns .json
r = requests.get(url)
df = pd.read_json(r.content) # create a df containing all zone info
print df[df['region'].str.contains("Uppsala")].head()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-23-55bbf5679808> in <module>()
----> 1 print df[df['region'].str.contains("Uppsala")].head()
C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
1670 if isinstance(key, (Series, np.ndarray, list)):
1671 # either boolean or fancy integer index
-> 1672 return self._getitem_array(key)
1673 elif isinstance(key, DataFrame):
1674 return self._getitem_frame(key)
C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\frame.pyc in _getitem_array(self, key)
1714 return self.take(indexer, axis=0, convert=False)
1715 else:
-> 1716 indexer = self.ix._convert_to_indexer(key, axis=1)
1717 return self.take(indexer, axis=1, convert=True)
1718
C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
1083 if isinstance(obj, tuple) and is_setter:
1084 return {'key': obj}
-> 1085 raise KeyError('%s not in index' % objarr[mask])
1086
1087 return indexer
KeyError: '[ nan nan nan ..., nan nan nan] not in index'
I don't understand the which I get a KeyErrorbecause df.columnsreturns:
我不明白我得到的是KeyError因为df.columns返回:
Index([u'dateCreated', u'id', u'latitude', u'longitude', u'name', u'pointsPerHour', u'region', u'takeoverPoints', u'totalTakeovers'], dtype='object')
So the Keyis in the list of columns and opening the page in an internet browser I can find 739 instances of 'Uppsala'.
因此,Key在列列表中并在 Internet 浏览器中打开页面时,我可以找到 739 个“乌普萨拉”实例。
The column in which I'm search was a nested .jsontable that looks like this {"id":200,"name":"Scotland","country":"gb"}. Do I have do something special to search between '{}' characters? Could somebody explain where I've made my mistake(s)?
我在其中搜索的列是一个嵌套.json表,如下所示{"id":200,"name":"Scotland","country":"gb"}。我是否做了一些特殊的事情来在 '{}' 字符之间进行搜索?有人可以解释我在哪里犯了错误吗?
回答by DSM
Looks to me like your regioncolumn contains dictionaries, which aren't really supported as elements, and so .strisn't working. One way to solve the problem is to promote the regiondictionaries to columns in their own right, maybe with something like:
在我看来,您的region专栏包含字典,但实际上并不支持将其作为元素,因此.str无法正常工作。解决该问题的一种方法是将region字典本身提升为列,可能是这样的:
>>> region = pd.DataFrame(df.pop("region").tolist())
>>> df = df.join(region, rsuffix="_region")
after which you have
之后你有
>>> df.head()
dateCreated id latitude longitude name pointsPerHour takeoverPoints totalTakeovers country id_region name_region
0 2013-06-15T08:00:00+0000 14639 55.947079 -3.206477 GrandSquare 1 185 32 gb 200 Scotland
1 2014-06-15T20:02:37+0000 31571 55.649181 12.609056 Stenringen 1 185 6 dk 172 Hovedstaden
2 2013-06-15T08:00:00+0000 18958 54.593570 -5.955772 Hospitality 0 250 1 gb 206 Northern Ireland
3 2013-06-15T08:00:00+0000 18661 53.754283 -1.526638 LanshawZone 0 250 0 gb 202 Yorkshire & The Humber
4 2013-06-15T08:00:00+0000 17424 55.949285 -3.144777 NoDogsZone 0 250 5 gb 200 Scotland
and
和
>>> df[df["name_region"].str.contains("Uppsala")].head()
dateCreated id latitude longitude name pointsPerHour takeoverPoints totalTakeovers country id_region name_region
28 2013-07-16T18:53:48+0000 20828 59.793476 17.775389 MoraStenRast 5 125 536 se 142 Uppsala
59 2013-02-08T21:42:53+0000 14797 59.570418 17.482116 B?lWoods 3 155 555 se 142 Uppsala
102 2014-06-19T12:00:00+0000 31843 59.617637 17.077094 EnaAlle 5 125 168 se 142 Uppsala
328 2012-09-24T20:08:22+0000 11461 59.634438 17.066398 BluePark 6 110 1968 se 142 Uppsala
330 2014-08-28T20:00:00+0000 33695 59.867027 17.710792 EnbackensBro 4 140 59 se 142 Uppsala
(A hack workaround would be df["region"].apply(str).str.contains("Uppsala"), but I think it's best to clean the data right at the start.)
(黑客解决方法是df["region"].apply(str).str.contains("Uppsala"),但我认为最好在开始时清理数据。)

