使用字符串搜索 Pandas 系列会产生 KeyError

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26005424/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:29:53  来源:igfitidea点击:

Searching a Pandas series using a string produces a KeyError

pythonpython-2.7pandasmatch

提问by Jason

I'm trying to use df[df['col'].str.contains("string")](described in these two SO questions: 1& 2) to select rows based on a partial string match. Here's my code:

我正在尝试使用df[df['col'].str.contains("string")](在这两个 SO 问题中描述:1& 2)基于部分字符串匹配来选择行。这是我的代码:

import requests
import json
import pandas as pd
import datetime

url = "http://api.turfgame.com/v4/zones/all" # get request returns .json 
r = requests.get(url)
df = pd.read_json(r.content) # create a df containing all zone info

print df[df['region'].str.contains("Uppsala")].head()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-23-55bbf5679808> in <module>()
----> 1 print df[df['region'].str.contains("Uppsala")].head()

C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   1670         if isinstance(key, (Series, np.ndarray, list)):
   1671             # either boolean or fancy integer index
-> 1672             return self._getitem_array(key)
   1673         elif isinstance(key, DataFrame):
   1674             return self._getitem_frame(key)

C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\frame.pyc in _getitem_array(self, key)
   1714             return self.take(indexer, axis=0, convert=False)
   1715         else:
-> 1716             indexer = self.ix._convert_to_indexer(key, axis=1)
   1717             return self.take(indexer, axis=1, convert=True)
   1718 

C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
   1083                     if isinstance(obj, tuple) and is_setter:
   1084                         return {'key': obj}
-> 1085                     raise KeyError('%s not in index' % objarr[mask])
   1086 
   1087                 return indexer

KeyError: '[ nan  nan  nan ...,  nan  nan  nan] not in index'

I don't understand the which I get a KeyErrorbecause df.columnsreturns:

我不明白我得到的是KeyError因为df.columns返回:

Index([u'dateCreated', u'id', u'latitude', u'longitude', u'name', u'pointsPerHour', u'region', u'takeoverPoints', u'totalTakeovers'], dtype='object')

So the Keyis in the list of columns and opening the page in an internet browser I can find 739 instances of 'Uppsala'.

因此,Key在列列表中并在 Internet 浏览器中打开页面时,我可以找到 739 个“乌普萨拉”实例。

The column in which I'm search was a nested .jsontable that looks like this {"id":200,"name":"Scotland","country":"gb"}. Do I have do something special to search between '{}' characters? Could somebody explain where I've made my mistake(s)?

我在其中搜索的列是一个嵌套.json表,如下所示{"id":200,"name":"Scotland","country":"gb"}。我是否做了一些特殊的事情来在 '{}' 字符之间进行搜索?有人可以解释我在哪里犯了错误吗?

回答by DSM

Looks to me like your regioncolumn contains dictionaries, which aren't really supported as elements, and so .strisn't working. One way to solve the problem is to promote the regiondictionaries to columns in their own right, maybe with something like:

在我看来,您的region专栏包含字典,但实际上并不支持将其作为元素,因此.str无法正常工作。解决该问题的一种方法是将region字典本身提升为列,可能是这样的:

>>> region = pd.DataFrame(df.pop("region").tolist())
>>> df = df.join(region, rsuffix="_region")

after which you have

之后你有

>>> df.head()
                dateCreated     id   latitude  longitude         name  pointsPerHour  takeoverPoints  totalTakeovers country  id_region             name_region
0  2013-06-15T08:00:00+0000  14639  55.947079  -3.206477  GrandSquare              1             185              32      gb        200                Scotland
1  2014-06-15T20:02:37+0000  31571  55.649181  12.609056   Stenringen              1             185               6      dk        172             Hovedstaden
2  2013-06-15T08:00:00+0000  18958  54.593570  -5.955772  Hospitality              0             250               1      gb        206        Northern Ireland
3  2013-06-15T08:00:00+0000  18661  53.754283  -1.526638  LanshawZone              0             250               0      gb        202  Yorkshire & The Humber
4  2013-06-15T08:00:00+0000  17424  55.949285  -3.144777   NoDogsZone              0             250               5      gb        200                Scotland

and

>>> df[df["name_region"].str.contains("Uppsala")].head()
                  dateCreated     id   latitude  longitude          name  pointsPerHour  takeoverPoints  totalTakeovers country  id_region name_region
28   2013-07-16T18:53:48+0000  20828  59.793476  17.775389  MoraStenRast              5             125             536      se        142     Uppsala
59   2013-02-08T21:42:53+0000  14797  59.570418  17.482116      B?lWoods              3             155             555      se        142     Uppsala
102  2014-06-19T12:00:00+0000  31843  59.617637  17.077094       EnaAlle              5             125             168      se        142     Uppsala
328  2012-09-24T20:08:22+0000  11461  59.634438  17.066398      BluePark              6             110            1968      se        142     Uppsala
330  2014-08-28T20:00:00+0000  33695  59.867027  17.710792  EnbackensBro              4             140              59      se        142     Uppsala

(A hack workaround would be df["region"].apply(str).str.contains("Uppsala"), but I think it's best to clean the data right at the start.)

(黑客解决方法是df["region"].apply(str).str.contains("Uppsala"),但我认为最好在开始时清理数据。)