pandas Python数据帧中的置信区间

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/53519823/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:10:54  来源:igfitidea点击:

Confidence Interval in Python dataframe

pythonpandasconfidence-interval

提问by MasterShifu

I am trying to calculate the mean and confidence interval(95%) of a column "Force" in a large dataset. I need the result by using the groupby function by grouping different "Classes".

我正在尝试计算大型数据集中“力”列的均值和置信区间(95%)。我需要通过对不同的“类”进行分组来使用 groupby 函数的结果。

When I calculate the mean and put it in the new dataframe, it gives me NaN values for all rows. I'm not sure if I'm going the correct way. Is there any easier way to do this?

当我计算平均值并将其放入新数据框中时,它为我提供了所有行的 NaN 值。我不确定我是否走正确的路。有没有更简单的方法来做到这一点?

This is the sample dataframe:

这是示例数据框:

df=pd.DataFrame({ 'Class': ['A1','A1','A1','A2','A3','A3'], 
                  'Force': [50,150,100,120,140,160] },
                   columns=['Class', 'Force'])

To calculate the confidence interval, the first step I did was to calculate the mean. This is what I used:

为了计算置信区间,我做的第一步是计算平均值。这是我使用的:

F1_Mean = df.groupby(['Class'])['Force'].mean()

This gave me NaNvalues for all rows.

这给了我NaN所有行的值。

回答by yoonghm

import pandas as pd
import numpy as np
import math

df=pd.DataFrame({'Class': ['A1','A1','A1','A2','A3','A3'], 
                 'Force': [50,150,100,120,140,160] },
                 columns=['Class', 'Force'])
print(df)
print('-'*30)

stats = df.groupby(['Class'])['Force'].agg(['mean', 'count', 'std'])
print(stats)
print('-'*30)

ci95_hi = []
ci95_lo = []

for i in stats.index:
    m, c, s = stats.loc[i]
    ci95_hi.append(m + 1.96*s/math.sqrt(c))
    ci95_lo.append(m - 1.96*s/math.sqrt(c))

stats['ci95_hi'] = ci95_hi
stats['ci95_lo'] = ci95_lo
print(stats)

The output is

输出是

  Class  Force
0    A1     50
1    A1    150
2    A1    100
3    A2    120
4    A3    140
5    A3    160
------------------------------
       mean  count        std
Class                        
A1      100      3  50.000000
A2      120      1        NaN
A3      150      2  14.142136
------------------------------
       mean  count        std     ci95_hi     ci95_lo
Class                                                
A1      100      3  50.000000  156.580326   43.419674
A2      120      1        NaN         NaN         NaN
A3      150      2  14.142136  169.600000  130.400000

回答by Dror Paz

As mentioned in the comments, I could not duplicate your error, but you can try to check that your numbers are stored as numbers and not as strings. use df.info()and make sure that the relevant columns are float or int:

正如评论中提到的,我无法复制您的错误,但您可以尝试检查您的数字是否存储为数字而不是字符串。使用df.info()并确保相关列是 float 或 int:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
Class    6 non-null object   # <--- non-number column
Force    6 non-null int64    # <--- number (int) column
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes