Python 在 Pandas DataFrame 列中存储多维数组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15806414/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Storing multidimensional arrays in pandas DataFrame columns
提问by John Salvatier
I'm hoping to use pandas as the main Trace (series of points in parameter space from MCMC) object.
我希望使用熊猫作为主要的 Trace(来自 MCMC 的参数空间中的一系列点)对象。
I have a list of dicts of string->array which I would like to store in pandas. The keys in the dicts are always the same, and for each key the shape of the numpy array is always the same, but the shape may be different for different keys and could have a different number of dimensions.
我有一个 string->array 的字典列表,我想将其存储在 Pandas 中。dicts 中的键总是相同的,对于每个键,numpy 数组的形状总是相同的,但不同键的形状可能不同,并且可能具有不同的维数。
I had been using self.append(dict_list, ignore_index = True)which seems to work well for 1d values, but for nd>1 values pandas stores the values as objects which doesn't allow for nice plotting and other nice things. Any suggestions on how to get better behavior?
我一直在使用self.append(dict_list, ignore_index = True)它似乎适用于 1d 值,但是对于 nd>1 值,pandas 将值存储为对象,这不允许进行良好的绘图和其他不错的事情。关于如何获得更好的行为的任何建议?
Sample data
样本数据
point = {'x': array(-0.47652306228698005),
'y': array([[-0.41809043],
[ 0.48407823]])}
points = 10 * [ point]
I'd like to be able to do something like
我希望能够做类似的事情
df = DataFrame(points)
or
或者
df = DataFrame()
df.append(points, ignore_index=True)
and have
并且有
>> df['x'][1].shape
()
>> df['y'][1].shape
(2,1)
回答by Eike
It goes a bit against Pandas' philosophy, which seems to see Seriesas a one-dimensional data structure. Therefore you have to create the Seriesby hand, tell them that they have data type "object". This means don't apply any automatic data conversions.
这有点违背 Pandas 的哲学,它似乎被Series视为一维数据结构。因此你必须Series手工创建,告诉他们他们有数据类型"object"。这意味着不要应用任何自动数据转换。
You can do it like this (reordered Ipython session):
你可以这样做(重新排序的 Ipython 会话):
In [9]: import pandas as pd
In [1]: point = {'x': array(-0.47652306228698005),
...: 'y': array([[-0.41809043],
...: [ 0.48407823]])}
In [2]: points = 10 * [ point]
In [5]: lx = [p["x"] for p in points]
In [7]: ly = [p["y"] for p in points]
In [40]: sx = pd.Series(lx, dtype=numpy.dtype("object"))
In [38]: sy = pd.Series(ly, dtype=numpy.dtype("object"))
In [43]: df = pd.DataFrame({"x":sx, "y":sy})
In [45]: df['x'][1].shape
Out[45]: ()
In [46]: df['y'][1].shape
Out[46]: (2, 1)
回答by ankostis
The relatively-new library xray[1] has Datasetand DataArraystructures that do exactly what you ask.
相对新的库X射线[1]有Dataset和DataArray你问这个做什么结构。
Here it is my take on your problem, written as an IPythonsession:
这是我对你的问题的看法,写成一个IPython会话:
>>> import numpy as np
>>> import xray
>>> ## Prepare data:
>>> #
>>> point = {'x': np.array(-0.47652306228698005),
... 'y': np.array([[-0.41809043],
... [ 0.48407823]])}
>>> points = 10 * [point]
>>> ## Convert to Xray DataArrays:
>>> #
>>> list_x = [p['x'] for p in points]
>>> list_y = [p['y'] for p in points]
>>> da_x = xray.DataArray(list_x, [('x', range(len(list_x)))])
>>> da_y = xray.DataArray(list_y, [
... ('x', range(len(list_y))),
... ('y0', range(2)),
... ('y1', [0]),
... ])
These are the two DataArrayinstances we built so far:
这是DataArray我们迄今为止构建的两个实例:
>>> print(da_x)
<xray.DataArray (x: 10)>
array([-0.47652306, -0.47652306, -0.47652306, -0.47652306, -0.47652306,
-0.47652306, -0.47652306, -0.47652306, -0.47652306, -0.47652306])
Coordinates:
* x (x) int32 0 1 2 3 4 5 6 7 8 9
>>> print(da_y.T) ## Transposed, to save lines.
<xray.DataArray (y1: 1, y0: 2, x: 10)>
array([[[-0.41809043, -0.41809043, -0.41809043, -0.41809043, -0.41809043,
-0.41809043, -0.41809043, -0.41809043, -0.41809043, -0.41809043],
[ 0.48407823, 0.48407823, 0.48407823, 0.48407823, 0.48407823,
0.48407823, 0.48407823, 0.48407823, 0.48407823, 0.48407823]]])
Coordinates:
* x (x) int32 0 1 2 3 4 5 6 7 8 9
* y0 (y0) int32 0 1
* y1 (y1) int32 0
We can now merge these two DataArrayon their common xdimension into a DataSet:
我们现在可以将这两个DataArray在它们的共同x维度上合并为一个DataSet:
>>> ds = xray.Dataset({'X':da_x, 'Y':da_y})
>>> print(ds)
<xray.Dataset>
Dimensions: (x: 10, y0: 2, y1: 1)
Coordinates:
* x (x) int32 0 1 2 3 4 5 6 7 8 9
* y0 (y0) int32 0 1
* y1 (y1) int32 0
Data variables:
X (x) float64 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 ...
Y (x, y0, y1) float64 -0.4181 0.4841 -0.4181 0.4841 -0.4181 0.4841 -0.4181 ...
And we can finally access and aggregate data the way you wanted:
我们最终可以按照您想要的方式访问和聚合数据:
>>> ds['X'].sum()
<xray.DataArray 'X' ()>
array(-4.765230622869801)
>>> ds['Y'].sum()
<xray.DataArray 'Y' ()>
array(0.659878)
>>> ds['Y'].sum(axis=1)
<xray.DataArray 'Y' (x: 10, y1: 1)>
array([[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878]])
Coordinates:
* x (x) int32 0 1 2 3 4 5 6 7 8 9
* y1 (y1) int32 0
>>> np.all(ds['Y'].sum(axis=1) == ds['Y'].sum(dim='y0'))
True
>>>> ds['X'].sum(dim='y0')
Traceback (most recent call last):
ValueError: 'y0' not found in array dimensions ('x',)
[1] A library for handling N-dimensional data with labels, like pandas does for 2D: http://xray.readthedocs.org/en/stable/data-structures.html#dataset
[1] 处理带有标签的 N 维数据的库,就像 Pandas 处理 2D 一样:http: //xray.readthedocs.org/en/stable/data-structures.html#dataset
回答by hobs
Combining @Eike's answerand @JohnSalvatier's comment seems pretty Pandasonic:
结合@Eike的回答和@JohnSalvatier 的评论似乎很Pandasonic:
>>> import pandas as pd
>>> np = pandas.np
>>> point = {'x': np.array(-0.47652306228698005),
... 'y': np.array([[-0.41809043],
... [ 0.48407823]])}
>>> points = 10 * [point] # this creates a list of 10 point dicts
>>> df = pd.DataFrame().append(points)
>>> df.x
# 0 -0.476523062287
# ...
# 9 -0.476523062287
# Name: x, dtype: object
>>> df.y
# 0 [[-0.41809043], [0.48407823]]
# ...
# 9 [[-0.41809043], [0.48407823]]
# Name: y, dtype: object
>>> df.y[0]
# array([[-0.41809043],
# [ 0.48407823]])
>>> df.y[0].shape
# (2, 1)
To plot (and do all the other cool 2-D Pandas things) you still have to manually convert the column of arrays back to a DataFrame:
要绘制(并执行所有其他很酷的 2-D Pandas 操作),您仍然必须手动将数组列转换回 DataFrame:
>>> dfy = pd.DataFrame([row.T[0] for row in df2.y])
>>> dfy += np.matrix([[0] * 10, range(10)]).T
>>> dfy *= np.matrix([range(10), range(10)]).T
>>> dfy.plot()
To store this on disk, use to_pickle:
要将其存储在磁盘上,请使用to_pickle:
>>> df.to_pickle('/tmp/sotest.pickle')
>>> df2 = pd.read_pickle('/tmp/sotest.pickle')
>>> df.y[0].shape
# (2, 1)
If you use to_csvyour np.arrays become strings:
如果你用to_csv你的np.array小号成为字符串:
>>> df.to_csv('/tmp/sotest.csv')
>>> df2 = pd.DataFrame.from_csv('/tmp/sotest.csv')
>>> df2.y[0]
# '[[-0.41809043]\n [ 0.48407823]]'


