pandas 熊猫，数据框，groupby，std

Question

提问by LetMeSOThat4U

New to pandas here. A (trivial) problem: hosts, operations, execution times. I want to group by host, then by host+operation, calculate std deviation for execution time per host, then by host+operation pair. Seems simple?

这里的Pandas新手。一个（微不足道的）问题：主机、操作、执行时间。我想按主机分组，然后按主机+操作，计算每个主机的执行时间的标准偏差，然后按主机+操作对。看起来很简单？

It works for grouping by a single column:

它适用于按单列分组：

df
Out[360]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 132564 entries, 0 to 132563
Data columns (total 9 columns):
datespecial    132564  non-null values
host           132564  non-null values
idnum          132564  non-null values
operation      132564  non-null values
time           132564  non-null values
...
dtypes: float32(1), int64(2), object(6)



byhost = df.groupby('host')


byhost.std()
Out[362]:
                 datespecial         idnum      time
host
ahost1.test  11946.961952  40367.033852  0.003699
host1.test   15484.975077  38206.578115  0.008800
host10.test           NaN  37644.137631  0.018001
...

Nice. Now:

好的。现在：

byhostandop = df.groupby(['host', 'operation'])

byhostandop.std()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-364-2c2566b866c4> in <module>()
----> 1 byhostandop.std()

/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in std(self, ddof)
    386         # todo, implement at cython level?
    387         if ddof == 1:
--> 388             return self._cython_agg_general('std')
    389         else:
    390             f = lambda x: x.std(ddof=ddof)

/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_general(self, how, numeric_only)
   1615
   1616     def _cython_agg_general(self, how, numeric_only=True):
-> 1617         new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only)
   1618         return self._wrap_agged_blocks(new_blocks)
   1619

/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_blocks(self, how, numeric_only)
   1653                 values = com.ensure_float(values)
   1654
-> 1655             result, _ = self.grouper.aggregate(values, how, axis=agg_axis)
   1656
   1657             # see if we can cast the block back to the original dtype

/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, values, how, axis)
    838                 if is_numeric:
    839                     result = lib.row_bool_subset(result,
--> 840                                                  (counts > 0).view(np.uint8))
    841                 else:
    842                     result = lib.row_bool_subset_object(result,

/home/username/anaconda/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.row_bool_subset (pandas/lib.c:16540)()

ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'

Huh?? Why do I get this exception?

嗯？？为什么我会收到此异常？

采纳答案by Roman Pekar

It's important to know your version of Pandas / Python. Looks like this exception could arise in Pandas version < 0.10 (see ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'). To avoid this, you can cast your floatcolumns to float64:

了解您的 Pandas / Python 版本很重要。看起来这个异常可能出现在 Pandas 版本 < 0.10 中（参见ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'）。为避免这种情况，您可以将float列转换为float64：

df.astype('float64')

To calculate std()on selected columns, just select columns :)

要计算std()选定的列，只需选择列:)

>>> df = pd.DataFrame({'a':range(10), 'b':range(10,20), 'c':list('abcdefghij'), 'g':[1]*3 + [2]*3 + [3]*4})
>>> df
   a   b  c  g
0  0  10  a  1
1  1  11  b  1
2  2  12  c  1
3  3  13  d  2
4  4  14  e  2
5  5  15  f  2
6  6  16  g  3
7  7  17  h  3
8  8  18  i  3
9  9  19  j  3
>>> df.groupby('g')[['a', 'b']].std()
          a         b
g                    
1  1.000000  1.000000
2  1.000000  1.000000
3  1.290994  1.290994

update

更新

As far as it goes, it looks like std()is calling aggregation()on the groupbyresult, and a subtle bug (see here - Python Pandas: Using Aggregate vs Apply to define new columns). To avoid this, you can use apply():

至于它去，它看起来像std()呼吁aggregation()的groupby结果，而个微妙的问题（见这里- Python的Pandas：使用聚集VS应用来定义新列）。为避免这种情况，您可以使用apply()：

byhostandop['time'].apply(lambda x: x.std())

pandas 熊猫，数据框，groupby，std

提问by LetMeSOThat4U

采纳答案by Roman Pekar

update

更新

相关推荐

最近更新

标签

pandas 熊猫，数据框，groupby，std

提问by LetMeSOThat4U

采纳答案by Roman Pekar

update

更新

相关推荐

Pandas：如何访问索引的值

pandas 将日期列和时间列合并为日期时间列

在带有分层索引的 Pandas 数据框中使用 iloc 时遇到问题

Pandas：一种使用命名元组列表初始化数据框的简洁方法

相关推荐

最近更新

标签