Python 如何获得 NumPy 数组的描述性统计信息？

Question

提问by beta

I use the following code to create a numpy-ndarray. The file has 9 columns. I explicitly type each column:

我使用以下代码创建一个 numpy-ndarray。该文件有 9 列。我明确输入每一列：

dataset = np.genfromtxt("data.csv", delimiter=",",dtype=('|S1', float, float,float,float,float,float,float,int))

Now I would like to get some descriptive statistics for each column (min, max, stdev, mean, median, etc.). Shouldn't there be an easy way to do this?

现在我想获得每列的一些描述性统计数据（最小值、最大值、标准差、平均值、中位数等）。不应该有一个简单的方法来做到这一点吗？

I tried this:

我试过这个：

from scipy import stats
stats.describe(dataset)

but this returns an error: TypeError: cannot perform reduce with flexible type

但这会返回一个错误： TypeError: cannot perform reduce with flexible type

How can I get descriptive statistics of the created NumPy array?

如何获得创建的 NumPy 数组的描述性统计信息？

Answer 1

采纳答案by M.T

This is not a pretty solution, but it gets the job done. The problem is that by specifying multiple dtypes, you are essentially making a 1D-array of tuples (actually np.void), which cannot be described by stats as it includes multiple different types, incl. strings.

这不是一个很好的解决方案，但它可以完成工作。的问题是，通过指定多个dtypes，实质上是使元组（实际上的1D阵列np.void），它不能由统计，因为它包括多个不同类型的，含进行说明。字符串。

This could be resolved by either reading it in two rounds, or using pandas with read_csv.

这可以通过分两轮阅读或使用带有read_csv.

If you decide to stick to numpy:

如果您决定坚持numpy：

import numpy as np
a = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=range(1,9))
s = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=0,dtype='|S1')

from scipy import stats
for arr in a: #do not need the loop at this point, but looks prettier
    print(stats.describe(arr))
#Output per print:
DescribeResult(nobs=6, minmax=(0.34999999999999998, 0.70999999999999996), mean=0.54500000000000004, variance=0.016599999999999997, skewness=-0.3049304880932534, kurtosis=-0.9943046886340534)

Note that in this example the final array has dtypeas float, not int, but can easily (if necessary) be converted to int using arr.astype(int)

请注意，在此示例中，最终数组具有dtypeas float， not int，但可以轻松（如有必要）使用arr.astype(int)

Answer 2

回答by hpaulj

The question of how to deal with mixed data from genfromtxtcomes up often. People expect a 2d array, and instead get a 1d that they can't index by column. That's because they get a structured array - with different dtype for each column.

如何处理genfromtxt来自混合数据的问题经常出现。人们期望一个 2d 数组，而是得到一个他们无法按列索引的 1d。那是因为他们得到了一个结构化数组——每列都有不同的 dtype。

All the examples in the genfromtxtdoc show this:

genfromtxt文档中的所有示例都显示了这一点：

>>> s = StringIO("1,1.3,abcde")
>>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
... ('mystring','S5')], delimiter=",")
>>> data
array((1, 1.3, 'abcde'),
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])

But let me demonstrate how to access this kind of data

但是让我演示一下如何访问这种数据

In [361]: txt=b"""A, 1,2,3
     ...: B,4,5,6
     ...: """
In [362]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,int,float,int'))
In [363]: data
Out[363]: 
array([(b'A', 1, 2.0, 3), (b'B', 4, 5.0, 6)], 
      dtype=[('f0', 'S1'), ('f1', '<i4'), ('f2', '<f8'), ('f3', '<i4')])

So my array has 2 records (check the shape), which are displayed as tuples in a list.

所以我的数组有 2 条记录（检查形状），它们在列表中显示为元组。

You access fieldsby name, not by column number (do I need to add a structured array documentation link?)

您fields按名称访问，而不是按列号访问（我需要添加结构化数组文档链接吗？）

In [364]: data['f0']
Out[364]: 
array([b'A', b'B'], 
      dtype='|S1')
In [365]: data['f1']
Out[365]: array([1, 4])

In a case like this might be more useful if I choose a dtypewith 'subarrays'. This a more advanced dtype topic

在这种情况下，如果我选择dtype带有“子数组”的可能会更有用。这是一个更高级的 dtype 主题

In [367]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,(3)float'))
In [368]: data
Out[368]: 
array([(b'A', [1.0, 2.0, 3.0]), (b'B', [4.0, 5.0, 6.0])], 
      dtype=[('f0', 'S1'), ('f1', '<f8', (3,))])
In [369]: data['f1']
Out[369]: 
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])

The character column is still loaded as S1, but the numbers are now in a 3 column array. Note that they are all float (or int).

字符列仍加载为S1，但数字现在位于 3 列数组中。请注意，它们都是浮点数（或整数）。

In [371]: from scipy import stats
In [372]: stats.describe(data['f1'])
Out[372]: DescribeResult(nobs=2, 
   minmax=(array([ 1.,  2.,  3.]), array([ 4.,  5.,  6.])),
   mean=array([ 2.5,  3.5,  4.5]), 
   variance=array([ 4.5,  4.5,  4.5]), 
   skewness=array([ 0.,  0.,  0.]), 
   kurtosis=array([-2., -2., -2.]))

Python 如何获得 NumPy 数组的描述性统计信息？

提问by beta

采纳答案by M.T

回答by hpaulj

相关推荐

最近更新

标签

Python 如何获得 NumPy 数组的描述性统计信息？

提问by beta

采纳答案by M.T

回答by hpaulj

相关推荐

Python Selenium - AttributeError：WebElement 对象没有属性 sendKeys

Python write() 参数必须是 str，而不是字节

Python 将大型 Pandas 数据帧分块写入 CSV 文件

Python 如何使用 matplotlib 从 .txt 文件中绘制数据？

相关推荐

最近更新

标签