Python 如何获得 NumPy 数组的描述性统计信息?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38583738/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 21:09:12  来源:igfitidea点击:

How can I get descriptive statistics of a NumPy array?

pythonnumpymultidimensional-arrayscipy

提问by beta

I use the following code to create a numpy-ndarray. The file has 9 columns. I explicitly type each column:

我使用以下代码创建一个 numpy-ndarray。该文件有 9 列。我明确输入每一列:

dataset = np.genfromtxt("data.csv", delimiter=",",dtype=('|S1', float, float,float,float,float,float,float,int))

Now I would like to get some descriptive statistics for each column (min, max, stdev, mean, median, etc.). Shouldn't there be an easy way to do this?

现在我想获得每列的一些描述性统计数据(最小值、最大值、标准差、平均值、中位数等)。不应该有一个简单的方法来做到这一点吗?

I tried this:

我试过这个:

from scipy import stats
stats.describe(dataset)

but this returns an error: TypeError: cannot perform reduce with flexible type

但这会返回一个错误: TypeError: cannot perform reduce with flexible type

How can I get descriptive statistics of the created NumPy array?

如何获得创建的 NumPy 数组的描述性统计信息?

采纳答案by M.T

This is not a pretty solution, but it gets the job done. The problem is that by specifying multiple dtypes, you are essentially making a 1D-array of tuples (actually np.void), which cannot be described by stats as it includes multiple different types, incl. strings.

这不是一个很好的解决方案,但它可以完成工作。的问题是,通过指定多个dtypes,实质上是使元组(实际上的1D阵列np.void),它不能由统计,因为它包括多个不同类型的,含进行说明。字符串。

This could be resolved by either reading it in two rounds, or using pandas with read_csv.

这可以通过分两轮阅读或使用带有read_csv.

If you decide to stick to numpy:

如果您决定坚持numpy

import numpy as np
a = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=range(1,9))
s = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=0,dtype='|S1')

from scipy import stats
for arr in a: #do not need the loop at this point, but looks prettier
    print(stats.describe(arr))
#Output per print:
DescribeResult(nobs=6, minmax=(0.34999999999999998, 0.70999999999999996), mean=0.54500000000000004, variance=0.016599999999999997, skewness=-0.3049304880932534, kurtosis=-0.9943046886340534)

Note that in this example the final array has dtypeas float, not int, but can easily (if necessary) be converted to int using arr.astype(int)

请注意,在此示例中,最终数组具有dtypeas float, not int,但可以轻松(如有必要)使用arr.astype(int)

回答by hpaulj

The question of how to deal with mixed data from genfromtxtcomes up often. People expect a 2d array, and instead get a 1d that they can't index by column. That's because they get a structured array - with different dtype for each column.

如何处理genfromtxt来自混合数据的问题经常出现。人们期望一个 2d 数组,而是得到一个他们无法按列索引的 1d。那是因为他们得到了一个结构化数组——每列都有不同的 dtype。

All the examples in the genfromtxtdoc show this:

genfromtxt文档中的所有示例都显示了这一点:

>>> s = StringIO("1,1.3,abcde")
>>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
... ('mystring','S5')], delimiter=",")
>>> data
array((1, 1.3, 'abcde'),
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])

But let me demonstrate how to access this kind of data

但是让我演示一下如何访问这种数据

In [361]: txt=b"""A, 1,2,3
     ...: B,4,5,6
     ...: """
In [362]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,int,float,int'))
In [363]: data
Out[363]: 
array([(b'A', 1, 2.0, 3), (b'B', 4, 5.0, 6)], 
      dtype=[('f0', 'S1'), ('f1', '<i4'), ('f2', '<f8'), ('f3', '<i4')])

So my array has 2 records (check the shape), which are displayed as tuples in a list.

所以我的数组有 2 条记录(检查形状),它们在列表中显示为元组。

You access fieldsby name, not by column number (do I need to add a structured array documentation link?)

fields按名称访问,而不是按列号访问(我需要添加结构化数组文档链接吗?)

In [364]: data['f0']
Out[364]: 
array([b'A', b'B'], 
      dtype='|S1')
In [365]: data['f1']
Out[365]: array([1, 4])

In a case like this might be more useful if I choose a dtypewith 'subarrays'. This a more advanced dtype topic

在这种情况下,如果我选择dtype带有“子数组”的可能会更有用。这是一个更高级的 dtype 主题

In [367]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,(3)float'))
In [368]: data
Out[368]: 
array([(b'A', [1.0, 2.0, 3.0]), (b'B', [4.0, 5.0, 6.0])], 
      dtype=[('f0', 'S1'), ('f1', '<f8', (3,))])
In [369]: data['f1']
Out[369]: 
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])

The character column is still loaded as S1, but the numbers are now in a 3 column array. Note that they are all float (or int).

字符列仍加载为S1,但数字现在位于 3 列数组中。请注意,它们都是浮点数(或整数)。

In [371]: from scipy import stats
In [372]: stats.describe(data['f1'])
Out[372]: DescribeResult(nobs=2, 
   minmax=(array([ 1.,  2.,  3.]), array([ 4.,  5.,  6.])),
   mean=array([ 2.5,  3.5,  4.5]), 
   variance=array([ 4.5,  4.5,  4.5]), 
   skewness=array([ 0.,  0.,  0.]), 
   kurtosis=array([-2., -2., -2.]))