使用字符串搜索 Pandas 系列会产生 KeyError

Question

提问by Jason

I'm trying to use df[df['col'].str.contains("string")](described in these two SO questions: 1& 2) to select rows based on a partial string match. Here's my code:

我正在尝试使用df[df['col'].str.contains("string")]（在这两个 SO 问题中描述：1& 2）基于部分字符串匹配来选择行。这是我的代码：

import requests
import json
import pandas as pd
import datetime

url = "http://api.turfgame.com/v4/zones/all" # get request returns .json 
r = requests.get(url)
df = pd.read_json(r.content) # create a df containing all zone info

print df[df['region'].str.contains("Uppsala")].head()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-23-55bbf5679808> in <module>()
----> 1 print df[df['region'].str.contains("Uppsala")].head()

C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   1670         if isinstance(key, (Series, np.ndarray, list)):
   1671             # either boolean or fancy integer index
-> 1672             return self._getitem_array(key)
   1673         elif isinstance(key, DataFrame):
   1674             return self._getitem_frame(key)

C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\frame.pyc in _getitem_array(self, key)
   1714             return self.take(indexer, axis=0, convert=False)
   1715         else:
-> 1716             indexer = self.ix._convert_to_indexer(key, axis=1)
   1717             return self.take(indexer, axis=1, convert=True)
   1718 

C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
   1083                     if isinstance(obj, tuple) and is_setter:
   1084                         return {'key': obj}
-> 1085                     raise KeyError('%s not in index' % objarr[mask])
   1086 
   1087                 return indexer

KeyError: '[ nan  nan  nan ...,  nan  nan  nan] not in index'

I don't understand the which I get a KeyErrorbecause df.columnsreturns:

我不明白我得到的是KeyError因为df.columns返回：

Index([u'dateCreated', u'id', u'latitude', u'longitude', u'name', u'pointsPerHour', u'region', u'takeoverPoints', u'totalTakeovers'], dtype='object')

So the Keyis in the list of columns and opening the page in an internet browser I can find 739 instances of 'Uppsala'.

因此，Key在列列表中并在 Internet 浏览器中打开页面时，我可以找到 739 个“乌普萨拉”实例。

The column in which I'm search was a nested .jsontable that looks like this {"id":200,"name":"Scotland","country":"gb"}. Do I have do something special to search between '{}' characters? Could somebody explain where I've made my mistake(s)?

我在其中搜索的列是一个嵌套.json表，如下所示{"id":200,"name":"Scotland","country":"gb"}。我是否做了一些特殊的事情来在 '{}' 字符之间进行搜索？有人可以解释我在哪里犯了错误吗？

Answer 1

回答by DSM

Looks to me like your regioncolumn contains dictionaries, which aren't really supported as elements, and so .strisn't working. One way to solve the problem is to promote the regiondictionaries to columns in their own right, maybe with something like:

在我看来，您的region专栏包含字典，但实际上并不支持将其作为元素，因此.str无法正常工作。解决该问题的一种方法是将region字典本身提升为列，可能是这样的：

>>> region = pd.DataFrame(df.pop("region").tolist())
>>> df = df.join(region, rsuffix="_region")

after which you have

之后你有

>>> df.head()
                dateCreated     id   latitude  longitude         name  pointsPerHour  takeoverPoints  totalTakeovers country  id_region             name_region
0  2013-06-15T08:00:00+0000  14639  55.947079  -3.206477  GrandSquare              1             185              32      gb        200                Scotland
1  2014-06-15T20:02:37+0000  31571  55.649181  12.609056   Stenringen              1             185               6      dk        172             Hovedstaden
2  2013-06-15T08:00:00+0000  18958  54.593570  -5.955772  Hospitality              0             250               1      gb        206        Northern Ireland
3  2013-06-15T08:00:00+0000  18661  53.754283  -1.526638  LanshawZone              0             250               0      gb        202  Yorkshire & The Humber
4  2013-06-15T08:00:00+0000  17424  55.949285  -3.144777   NoDogsZone              0             250               5      gb        200                Scotland

and

和

>>> df[df["name_region"].str.contains("Uppsala")].head()
                  dateCreated     id   latitude  longitude          name  pointsPerHour  takeoverPoints  totalTakeovers country  id_region name_region
28   2013-07-16T18:53:48+0000  20828  59.793476  17.775389  MoraStenRast              5             125             536      se        142     Uppsala
59   2013-02-08T21:42:53+0000  14797  59.570418  17.482116      B?lWoods              3             155             555      se        142     Uppsala
102  2014-06-19T12:00:00+0000  31843  59.617637  17.077094       EnaAlle              5             125             168      se        142     Uppsala
328  2012-09-24T20:08:22+0000  11461  59.634438  17.066398      BluePark              6             110            1968      se        142     Uppsala
330  2014-08-28T20:00:00+0000  33695  59.867027  17.710792  EnbackensBro              4             140              59      se        142     Uppsala

(A hack workaround would be df["region"].apply(str).str.contains("Uppsala"), but I think it's best to clean the data right at the start.)

（黑客解决方法是df["region"].apply(str).str.contains("Uppsala")，但我认为最好在开始时清理数据。）

使用字符串搜索 Pandas 系列会产生 KeyError

提问by Jason

回答by DSM

相关推荐

最近更新

标签

使用字符串搜索 Pandas 系列会产生 KeyError

提问by Jason

回答by DSM

相关推荐

创建 Pandas DataFrame 的元素并将其设置为列表

pandas 在python pandas的数据框中为具有选定列的每行数据创建哈希值

Python pandas - 特定的合并/替换

pandas 带有熊猫数据框的矢量化半正弦公式

相关推荐

最近更新

标签