pandas python中带有字符串列表的列

Question

提问by anonuser0428

I have a pandas dataframe like the following:

我有一个如下所示的Pandas数据框：

                                          categories  review_count
0                  [Burgers, Fast Food, Restaurants]           137
1                         [Steakhouses, Restaurants]           176
2  [Food, Coffee & Tea, American (New), Restaurants]           390
...                                          ....              ...
...                                          ....              ...
...                                          ....              ...

From this dataFrame,I would like to extract only those rows wherein the list in the 'categories' column of that row contains the category 'Restaurants'. I have so far tried: df[[df.categories.isin('Restaurants'),review_count]],

从这个数据框中，我只想提取那些行的“类别”列中的列表包含类别“餐厅”的行。到目前为止df[[df.categories.isin('Restaurants'),review_count]]，我已经尝试过：

as I also have other columns in the dataFrame, I specified these two columns that I want to extract. But I get the error:

由于我在数据框中还有其他列，因此我指定了要提取的这两列。但我收到错误：

TypeError: unhashable type: 'list'

I don't have much idea what this error means as I am very new to pandas. Please let me know how I can achieve my goal of extracting only those rows from the dataFrame wherein the 'categories' column for that row has the string 'Restaurants' as part of the categories_list. Any help would be much appreciated.

我不太了解这个错误意味着什么，因为我对Pandas很陌生。请让我知道如何实现我的目标，即仅从数据框中提取那些行，其中该行的“类别”列将字符串“餐厅”作为类别列表的一部分。任何帮助将非常感激。

Thanks in advance!

提前致谢！

Answer 1

采纳答案by Marius

I think you may have to use a lambdafunction for this, since you can test whether a value in your column isinsome sequence, but pandasdoesn't seem to provide a function for testing whether the sequence in your column contains some value:

我认为您可能必须为此使用一个lambda函数，因为您可以测试列中的值是否为isin某个序列，但pandas似乎没有提供用于测试列中的序列是否包含某个值的函数：

import pandas as pd
categories = [['fast_food', 'restaurant'], ['coffee', 'cafe'], ['burger', 'restaurant']]
counts = [137, 176, 390]
df = pd.DataFrame({'categories': categories, 'review_count': counts})
# Show which rows contain 'restaurant'
df.categories.map(lambda x: 'restaurant' in x)
# Subset the dataframe using this:
df[df.categories.map(lambda x: 'restaurant' in x)]

Output:

输出：

Out[11]: 
                categories  review_count
0  [fast_food, restaurant]           137
2     [burger, restaurant]           390

Answer 2

回答by Jeremy Low

Ok, so I've been trying to figure out an answer to this for quite a while now, but have come up empty (without basically writing a small recursing program to expand the list) and I think that's because, at first blush anyway, what you're trying to do isn't really that efficient (Jimmy C's comment about the lists being mutable is on point here) and isn't the way that you would do this most of the time in Pandas.

好的，所以我一直试图找出这个问题的答案已经有一段时间了，但结果是空的（基本上没有编写一个小的递归程序来扩展列表），我认为这是因为，无论如何，乍一看，您尝试做的并不是那么有效（Jimmy C 关于列表可变的评论在这里很重要）并且不是您大部分时间在 Pandas 中执行此操作的方式。

A better and (I think) faster way would be to store your nested list as column values so that you'd have:

更好且（我认为）更快的方法是将嵌套列表存储为列值，以便您拥有：

df
    review_count    Burgers   Fast Food   Restaurants    Steakhouses  Food    CoffeeTea  American (New)
0            137    True      True        True           False        False   False      False
1            176    False     False       True           True         False   False      False
2            390    False     False       True           False        True    True       True

Obviously, this would involve writing a python program to pull out your categories from their nested lists and then export that out to a DataFrame, but this one time hit (for the existing data) may be worthwhile for what you gain in using pandas to analyze the resulting dataframe.

显然，这将涉及编写一个 Python 程序来从嵌套列表中提取您的类别，然后将其导出到 DataFrame，但是这一次命中（对于现有数据）对于您使用 Pandas 进行分析所获得的收益可能是值得的结果数据框。

There's a section in Wes's book Python for Data Analysis called "Computing Indicator/Dummy Variables" (around p. 330 or so) which would be a good resource for this sort of operation.

Wes 的《Python for Data Analysis》一书中有一节名为“计算指标/虚拟变量”（大约第 330 页），这将是此类操作的一个很好的资源。

Sorry, that doesn't really answer your question, and I certainly don't know how feasible it is, but otherwise, you can try rtrwalker's solution, which looks pretty good, but it's the development branch, just FYI.

抱歉，这并没有真正回答你的问题，我当然不知道它有多可行，但除此之外，你可以尝试 rtrwalker 的解决方案，它看起来不错，但它是开发分支，仅供参考。

Answer 3

回答by rtrwalker

I think in pandas0.12 you can do things like:

我认为在 pandas0.12 中，您可以执行以下操作：

df.query('"Restaurants" in categories')

docs at pandas.DataFrame.query

pandas.DataFrame.query 上的文档

pandas python中带有字符串列表的列

提问by anonuser0428

采纳答案by Marius

回答by Jeremy Low

回答by rtrwalker

相关推荐

最近更新

标签

pandas python中带有字符串列表的列

提问by anonuser0428

采纳答案by Marius

回答by Jeremy Low

回答by rtrwalker

相关推荐

在 Pandas 中将一个时间序列插入另一个时间序列

pandas 如何根据条形图的值在 matplotlib 中创建自定义图例？

为 Python 2.7 64 位安装 Pandas - 无法找到 vcvarsall.bat 的错误

pandas 用另一个系列的值覆盖（更新）一个熊猫系列？

相关推荐

最近更新

标签