在 Pandas 中展平系列，即元素为列表的系列

Question

提问by meto

I have a series of the form:

我有一系列的形式：

s = Series([['a','a','b'],['b','b','c','d'],[],['a','b','e']])

which looks like

看起来像

0       [a, a, b]
1    [b, b, c, d]
2              []
3       [a, b, e]
dtype: object

I would like to count how many elements I have in total. My naive tentatives like

我想数一数我总共有多少个元素。我天真的试探者喜欢

s.values.hist()

or

或者

s.values.flatten()

didn't work. What am I doing wrong?

没有用。我究竟做错了什么？

Answer 1

采纳答案by Marius

s.map(len).sum()

does the trick. s.map(len)applies len()to each element and returns a series of all the lengths, then you can just use sumon that series.

诀窍。s.map(len)适用len()于每个元素并返回一系列所有长度，然后您可以sum在该系列上使用。

Answer 2

回答by heilala

If we stick with the pandas Series as in the original question, one neat option from the Pandas version 0.25.0 onwards is the Series.explode()routine. It returns an exploded list to rows, where the index will be duplicated for these rows.

如果我们像原始问题一样坚持使用 pandas Series，那么从 Pandas 0.25.0 版开始，一个巧妙的选择是Series.explode()例程。它返回一个分解的行列表，其中索引将被复制到这些行。

The original Series from the question:

来自问题的原始系列：

s = pd.Series([['a','a','b'],['b','b','c','d'],[],['a','b','e']])

Let's explode it and we get a Series, where the index is repeated. The index indicates the index of the original list.

让我们分解它，我们得到一个系列，其中索引重复。索引表示原始列表的索引。

>>> s.explode()
Out:
0      a
0      a
0      b
1      b
1      b
1      c
1      d
2    NaN
3      a
3      b
3      e
dtype: object

>>> type(s.explode())
Out:
pandas.core.series.Series

To count the number of elements we can now use the Series.value_counts():

要计算元素的数量，我们现在可以使用 Series.value_counts()：

>>> s.explode().value_counts()
Out:
b    4
a    3
d    1
c    1
e    1
dtype: int64

To include also NaN values:

还包括 NaN 值：

>>> s.explode().value_counts(dropna=False)
Out:
b      4
a      3
d      1
c      1
e      1
NaN    1
dtype: int64

Finally, plotting the histogram using Series.plot():

最后，使用 Series.plot() 绘制直方图：

>>> s.explode().value_counts(dropna=False).plot(kind = 'bar')

Answer 3

回答by FooBar

Personally, I lovehaving arrays in dataframes, for every single item a single column. It will give you much more functionality. So, here's my alternative approach

就我个人而言，我喜欢在数据框中使用数组，对于每一个单列。它会给你更多的功能。所以，这是我的替代方法

>>> raw = [['a', 'a', 'b'], ['b', 'b', 'c', 'd'], [], ['a', 'b', 'e']]
>>> df = pd.DataFrame(raw)
>>> df
Out[217]: 
      0     1     2     3
0     a     a     b  None
1     b     b     c     d
2  None  None  None  None
3     a     b     e  None

Now, see how many values we have in each row

现在，看看我们每行有多少个值

>>> df.count(axis=1)
Out[226]: 
0    3
1    4
2    0
3    3

Applying sum()here would give you what you wanted.

sum()在这里申请会给你你想要的。

Second, what you mentioned in a comment: get the distribution. There may be a cleaner approach here, but I still prefer the following over the hint that was given you in the comment

其次，您在评论中提到的内容：获取发行版。这里可能有更简洁的方法，但我仍然更喜欢以下内容而不是评论中给您的提示

>>> foo = [col.value_counts() for x, col in df.iteritems()]
>>> foo
Out[246]: 
[a    2
 b    1
 dtype: int64, b    2
 a    1
 dtype: int64, b    1
 c    1
 e    1
 dtype: int64, d    1
 dtype: int64]

foocontains distribution for every column now. The interpretation of columns is still "xth value", such that column 0 contains the distribution of all the "first values" in your arrays.

foo现在包含每列的分布。列的解释仍然是“第 x 个值”，因此第 0 列包含数组中所有“第一个值”的分布。

Next step, "sum them up".

下一步，“总结”。

>>> df2 = pd.DataFrame(foo)
>>> df2
Out[266]: 
    a   b   c   d   e
0   2   1 NaN NaN NaN
1   1   2 NaN NaN NaN
2 NaN   1   1 NaN   1
3 NaN NaN NaN   1 NaN
>>> test.sum(axis=0)
Out[264]: 
a    3
b    4
c    1
d    1
e    1
dtype: float64

Note that for these very simple problems the difference between a series of lists and a dataframe with columns per item is not big, but once you want to do realdata work, the latter gives you waymore functionality. Moreover, it can potentially be more efficient, since you can use pandas internal methods.

需要注意的是这些非常简单的问题，一系列的名单，并与每个项目列的数据帧之间的差别并不大，但一旦你想要做真实数据的工作，后者给你的方式更多的功能。此外，它可能会更高效，因为您可以使用 Pandas 内部方法。

在 Pandas 中展平系列，即元素为列表的系列

提问by meto

采纳答案by Marius

回答by heilala

回答by FooBar

相关推荐

最近更新

标签

在 Pandas 中展平系列，即元素为列表的系列

提问by meto

采纳答案by Marius

回答by heilala

回答by FooBar

相关推荐

pandas 理解熊猫数据帧中的数学错误

Python Pandas figsize 未定义

Pandas 在数据帧内的指定字符之后删除部分字符串

这是带有 notnull() 的 Pandas 错误还是我的根本误解（可能是误解）

相关推荐

最近更新

标签