在 Pandas 中展平系列,即元素为列表的系列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24027723/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Flatten a Series in pandas, i.e. a series whose elements are lists
提问by meto
I have a series of the form:
我有一系列的形式:
s = Series([['a','a','b'],['b','b','c','d'],[],['a','b','e']])
which looks like
看起来像
0 [a, a, b]
1 [b, b, c, d]
2 []
3 [a, b, e]
dtype: object
I would like to count how many elements I have in total. My naive tentatives like
我想数一数我总共有多少个元素。我天真的试探者喜欢
s.values.hist()
or
或者
s.values.flatten()
didn't work. What am I doing wrong?
没有用。我究竟做错了什么?
采纳答案by Marius
s.map(len).sum()
does the trick. s.map(len)applies len()to each element and returns a series of all the lengths, then you can just use sumon that series.
诀窍。s.map(len)适用len()于每个元素并返回一系列所有长度,然后您可以sum在该系列上使用。
回答by heilala
If we stick with the pandas Series as in the original question, one neat option from the Pandas version 0.25.0 onwards is the Series.explode()routine. It returns an exploded list to rows, where the index will be duplicated for these rows.
如果我们像原始问题一样坚持使用 pandas Series,那么从 Pandas 0.25.0 版开始,一个巧妙的选择是Series.explode()例程。它返回一个分解的行列表,其中索引将被复制到这些行。
The original Series from the question:
来自问题的原始系列:
s = pd.Series([['a','a','b'],['b','b','c','d'],[],['a','b','e']])
Let's explode it and we get a Series, where the index is repeated. The index indicates the index of the original list.
让我们分解它,我们得到一个系列,其中索引重复。索引表示原始列表的索引。
>>> s.explode()
Out:
0 a
0 a
0 b
1 b
1 b
1 c
1 d
2 NaN
3 a
3 b
3 e
dtype: object
>>> type(s.explode())
Out:
pandas.core.series.Series
To count the number of elements we can now use the Series.value_counts():
要计算元素的数量,我们现在可以使用 Series.value_counts():
>>> s.explode().value_counts()
Out:
b 4
a 3
d 1
c 1
e 1
dtype: int64
To include also NaN values:
还包括 NaN 值:
>>> s.explode().value_counts(dropna=False)
Out:
b 4
a 3
d 1
c 1
e 1
NaN 1
dtype: int64
Finally, plotting the histogram using Series.plot():
最后,使用 Series.plot() 绘制直方图:
>>> s.explode().value_counts(dropna=False).plot(kind = 'bar')
回答by FooBar
Personally, I lovehaving arrays in dataframes, for every single item a single column. It will give you much more functionality. So, here's my alternative approach
就我个人而言,我喜欢在数据框中使用数组,对于每一个单列。它会给你更多的功能。所以,这是我的替代方法
>>> raw = [['a', 'a', 'b'], ['b', 'b', 'c', 'd'], [], ['a', 'b', 'e']]
>>> df = pd.DataFrame(raw)
>>> df
Out[217]:
0 1 2 3
0 a a b None
1 b b c d
2 None None None None
3 a b e None
Now, see how many values we have in each row
现在,看看我们每行有多少个值
>>> df.count(axis=1)
Out[226]:
0 3
1 4
2 0
3 3
Applying sum()here would give you what you wanted.
sum()在这里申请会给你你想要的。
Second, what you mentioned in a comment: get the distribution. There may be a cleaner approach here, but I still prefer the following over the hint that was given you in the comment
其次,您在评论中提到的内容:获取发行版。这里可能有更简洁的方法,但我仍然更喜欢以下内容而不是评论中给您的提示
>>> foo = [col.value_counts() for x, col in df.iteritems()]
>>> foo
Out[246]:
[a 2
b 1
dtype: int64, b 2
a 1
dtype: int64, b 1
c 1
e 1
dtype: int64, d 1
dtype: int64]
foocontains distribution for every column now. The interpretation of columns is still "xth value", such that column 0 contains the distribution of all the "first values" in your arrays.
foo现在包含每列的分布。列的解释仍然是“第 x 个值”,因此第 0 列包含数组中所有“第一个值”的分布。
Next step, "sum them up".
下一步,“总结”。
>>> df2 = pd.DataFrame(foo)
>>> df2
Out[266]:
a b c d e
0 2 1 NaN NaN NaN
1 1 2 NaN NaN NaN
2 NaN 1 1 NaN 1
3 NaN NaN NaN 1 NaN
>>> test.sum(axis=0)
Out[264]:
a 3
b 4
c 1
d 1
e 1
dtype: float64
Note that for these very simple problems the difference between a series of lists and a dataframe with columns per item is not big, but once you want to do realdata work, the latter gives you waymore functionality. Moreover, it can potentially be more efficient, since you can use pandas internal methods.
需要注意的是这些非常简单的问题,一系列的名单,并与每个项目列的数据帧之间的差别并不大,但一旦你想要做真实数据的工作,后者给你的方式更多的功能。此外,它可能会更高效,因为您可以使用 Pandas 内部方法。


