pandas 如何在熊猫数据框中使用列表作为值?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26806054/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to use lists as values in pandas dataframe?
提问by JD Long
I have a dataframe that requires a subset of the columns to have entries with multiple values. below is a dataframe with a "runtimes" column that has the runtimes of a program in various conditions:
我有一个数据框,它需要列的子集来包含具有多个值的条目。下面是一个带有“运行时”列的数据框,其中包含程序在各种条件下的运行时间:
df = [{"condition": "a", "runtimes": [1,1.5,2]}, {"condition": "b", "runtimes": [0.5,0.75,1]}]
df = pandas.DataFrame(df)
this makes a dataframe:
这构成了一个数据框:
condition runtimes
0 a [1, 1.5, 2]
1 b [0.5, 0.75, 1]
how can I work with this dataframe and get pandas to treat its values as a numeric list? for example calculate the mean for "runtimes" column across the rows?
如何使用此数据框并让Pandas将其值视为数字列表?例如计算跨行“运行时”列的平均值?
df["runtimes"].mean()
gives the error: "Could not convert [1, 1.5, 2, 0.5, 0.75, 1] to numeric"
给出错误: "Could not convert [1, 1.5, 2, 0.5, 0.75, 1] to numeric"
it'd be useful to work with this dataframes and also to serialize them as csv files where a list like: [1, 1.5, 2]gets converted into "1,1.5,2"so that it's still a single entry in the csv file.
使用此数据帧并将它们序列化为 csv 文件会很有用,其中的列表如下: [1, 1.5, 2]被转换为"1,1.5,2"这样它仍然是 csv 文件中的单个条目。
回答by JD Long
It feels like you're trying to make Pandas be something it is not. If you always have 3 runtimes, you could make 3 columns. However the more Pandas-esqe approach is to normalize your data (no matter how many different trials you have) to something like this:
感觉就像你试图让 Pandas 成为它不是的东西。如果您始终有 3 个运行时,则可以创建 3 个列。然而,更多的 Pandas-esqe 方法是将您的数据(无论您有多少不同的试验)标准化为如下所示:
df = [{"condition": "a", "trial": 1, "runtime": 1},
{"condition": "a", "trial": 2, "runtime": 1.5},
{"condition": "a", "trial": 3, "runtime": 2},
{"condition": "b", "trial": 1, "runtime": .5},
{"condition": "b", "trial": 2, "runtime": .75},
{"condition": "b", "trial": 3, "runtime": 1}]
df = pd.DataFrame(df)
then you can
然后你可以
print df.groupby('condition').mean()
runtime trial
condition
a 1.50 2
b 0.75 2
The concept here is to keep the data tabular and only one value per cell. If you want to do nested list functions then you should be using lists, and not Pandas dataframes.
这里的概念是保持数据表格和每个单元格只有一个值。如果你想做嵌套列表函数,那么你应该使用列表,而不是 Pandas 数据框。
回答by Mike
It looks like pandas is trying to add up all the lists in the series and divide by the number of rows. This results in a list concatenation, and the result fails the numeric type check. This explains the list in your error.
看起来Pandas正在尝试将系列中的所有列表相加并除以行数。这将导致列表串联,并且结果未通过数字类型检查。这解释了错误中的列表。
You could compute the mean like this:
你可以像这样计算平均值:
df['runtimes'].apply(numpy.mean)
Aside from that, pandas doesn't like working with lists as values. If your data is tabular, consider breaking the list out into three separate columns.
除此之外,pandas 不喜欢使用列表作为值。如果您的数据是表格形式,请考虑将列表分成三个单独的列。
Serializing the column would work in a similar way:
序列化列的工作方式类似:
df['runtimes'].apply(lambda x: '"' + str(x)[1:-1] + '"')

