pandas 从pandas转换为numpy时如何保留列名

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40554179/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:25:31  来源:igfitidea点击:

How to keep column names when converting from pandas to numpy

pythonpandasnumpy

提问by user48956

According to this post, I should be able to access the names of columns in an ndarray as a.dtype.names

根据这篇文章,我应该能够访问 ndarray 中列的名称作为 a.dtype.names

Howevever, if I convert a pandas DataFrame to an ndarray with df.as_matrix() or df.values, then the dtype.names field is None. Additionally, if I try to assign column names to the ndarray

但是,如果我使用 df.as_matrix() 或 df.values 将 Pandas DataFrame 转换为 ndarray,则 dtype.names 字段为 None。此外,如果我尝试将列名分配给 ndarray

X = pd.DataFrame(dict(age=[40., 50., 60.], sys_blood_pressure=[140.,150.,160.]))
print X
print type(X.as_matrix())# <type 'numpy.ndarray'>
print type(X.as_matrix()[0]) # <type 'numpy.ndarray'>

m = X.as_matrix()
m.dtype.names = list(X.columns)

I get

我得到

ValueError: there are no fields defined

UPDATE:

更新:

I'm particularly interested in the cases where the matrix only needs to hold a single type (it is an ndarray of a specific numeric type), since I'd also like to use cython for optimization. (I suspect numpy records and structured arrays are more difficult to deal with since they're more freely typed.)

我对矩阵只需要保存单一类型(它是特定数字类型的 ndarray)的情况特别感兴趣,因为我也想使用 cython 进行优化。(我怀疑 numpy 记录和结构化数组更难处理,因为它们的类型更自由。)

Really, I'd just like to maintain the column_name meta data for arrays passed through a deep tree of sci-kit predictors. Its interface's .fit(X,y) and .predict(X) API don't permit passing additional meta-data about the column labels outside of the X and y object.

真的,我只想维护通过 sci-kit 预测器的深度树传递的数组的 column_name 元数据。它的接口的 .fit(X,y) 和 .predict(X) API 不允许在 X 和 y 对象之外传递关于列标签的额外元数据。

采纳答案by S0AndS0

Yet more methods of converting a pandas.DataFrameto numpy.arraywhile preserving label/column names

然而,更多的转换方法pandas.DataFramenumpy.array同时保持标签/列名

This is mainly for demonstrating how to set dtype/column_dtypes, because sometimes a data source iterator's output'll need some pre-normalization.

这主要是为了演示如何设置dtype/ column_dtypes,因为有时数据源迭代器的输出需要一些预规范化。



Method one inserts by column into a zeroed array of predefined heightand is loosely based on a Creating Structured Arraysguide that just a bit of web-crawling turned up

方法一按列插入到预定义高度的归零数组中,并且松散地基于创建结构化数组指南,该指南只是出现了一些网络爬行

import numpy


def to_tensor(dataframe, columns = [], dtypes = {}):
    # Use all columns from data frame if none where listed when called
    if len(columns) <= 0:
        columns = dataframe.columns
    # Build list of dtypes to use, updating from any `dtypes` passed when called
    dtype_list = []
    for column in columns:
        if column not in dtypes.keys():
            dtype_list.append(dataframe[column].dtype)
        else:
            dtype_list.append(dtypes[column])
    # Build dictionary with lists of column names and formatting in the same order
    dtype_dict = {
        'names': columns,
        'formats': dtype_list
    }
    # Initialize _mostly_ empty nupy array with column names and formatting
    numpy_buffer = numpy.zeros(
        shape = len(dataframe),
        dtype = dtype_dict)
    # Insert values from dataframe columns into numpy labels
    for column in columns:
        numpy_buffer[column] = dataframe[column].to_numpy()
    # Return results of conversion
    return numpy_buffer

Method two is based on user7138814's answerand will likely be more efficient as it is basically a wrapper for the built in to_recordsmethod available to pandas.DataFrames

方法二基于user7138814回答并且可能会更有效,因为它基本上是to_records可用于pandas.DataFrames的内置方法的包装器

def to_tensor(dataframe, columns = [], dtypes = {}, index = False):
    to_records_kwargs = {'index': index}
    if not columns:  # Default to all `dataframe.columns`
        columns = dataframe.columns
    if dtypes:       # Pull in modifications only for dtypes listed in `columns`
        to_records_kwargs['column_dtypes'] = {}
        for column in dtypes.keys():
            if column in columns:
                to_records_kwargs['column_dtypes'].update({column: dtypes.get(column)})
    return dataframe[columns].to_records(**to_records_kwargs)


With either of the above one could do...

有了上面的任何一个都可以做......

X = pandas.DataFrame(dict(age = [40., 50., 60.], sys_blood_pressure = [140., 150., 160.]))

# Example of overwriting dtype for a column
X_tensor = to_tensor(X, dtypes = {'age': 'int32'})

print("Ages -> {0}".format(X_tensor['age']))
print("SBPs -> {0}".format(X_tensor['sys_blood_pressure']))

... which shouldoutput...

...应该输出...

Ages -> array([40, 50, 60])
SBPs -> array([140., 150., 160.])

... and a full dump of X_tensorshould look like the following.

...一个完整的转储X_tensor应该如下所示。

array([(40, 140.), (50, 150.), (60, 160.)],
      dtype=[('age', '<i4'), ('sys_blood_pressure', '<f8')])

Some thoughts

一些想法

While method two will likely be more efficient than the first, method one (with some modifications) may be more useful for merging two or more pandas.DataFrames into one numpy.array.

虽然方法二可能比第一种更有效,但方法一(经过一些修改)对于将两个或多个pandas.DataFrames合并为一个可能更有用numpy.array

Additionally (after swinging back through to review), method one will likely face-plantas it's written with errors about to_records_kwargsnot being a mapping if dtypesis notdefined, next time I'm feeling Pythonic I may resolve that with an elsecondition.

另外(通过对审核理性回归之后),一个方法可能会面对的植物,因为它是有错误写to_records_kwargs不是如果映射dtypes,接下来的时间定义我觉得Python的我可能会解决与else条件。

回答by Nickil Maveli

Consider a DFas shown below:

考虑DF如下图所示:

X = pd.DataFrame(dict(one=['Strawberry', 'Fields', 'Forever'], two=[1,2,3]))
X

enter image description here

在此处输入图片说明

Provide a list of tuplesas data input to the structured array:

提供元组列表作为结构化数组的数据输入:

arr_ip = [tuple(i) for i in X.as_matrix()]

Ordered list of field names:

字段名称的有序列表:

dtyp = np.dtype(list(zip(X.dtypes.index, X.dtypes)))

Here, X.dtypes.indexgives you the column names and X.dtypesit's corresponding dtypes which are unified again into a list of tuplesand fed as input to the dtype elements to be constructed.

在这里,X.dtypes.index为您提供列名称及其X.dtypes相应的 dtype,它们再次统一为一个元组列表,并作为要构造的 dtype 元素的输入。

arr = np.array(arr_ip, dtype=dtyp)

gives:

给出:

arr
# array([('Strawberry', 1), ('Fields', 2), ('Forever', 3)], 
#       dtype=[('one', 'O'), ('two', '<i8')])

and

arr.dtype.names
# ('one', 'two')

回答by user7138814

Pandas dataframe also has a handy to_recordsmethod. Demo:

Pandas 数据框也有一个方便的to_records方法。演示:

X = pd.DataFrame(dict(age=[40., 50., 60.], 
                      sys_blood_pressure=[140.,150.,160.]))
m = X.to_records(index=False)
print repr(m)

Returns:

返回:

rec.array([(40.0, 140.0), (50.0, 150.0), (60.0, 160.0)], 
          dtype=[('age', '<f8'), ('sys_blood_pressure', '<f8')])

This is a "record array", which is an ndarray subclass that allows field access using attributes, e.g. m.agein addition to m['age'].

这是一个“记录数组”,它是一个 ndarray 子类,它允许使用属性进行字段访问,例​​如m.age除了m['age'].

You can pass this to a cython function as a regular float array by constructing a view:

您可以通过构建视图将其作为常规浮点数组传递给 cython 函数:

m_float = m.view(float).reshape(m.shape + (-1,))
print repr(m_float)

Which gives:

这使:

rec.array([[  40.,  140.],
           [  50.,  150.],
           [  60.,  160.]], 
          dtype=float64)

Note in order for this to work, the original Dataframe must have a float dtype for every column. To make sure use m = X.astype(float, copy=False).to_records(index=False).

请注意,为了使其工作,原始 Dataframe 必须为每一列都有一个 float dtype。为了确保使用m = X.astype(float, copy=False).to_records(index=False).

回答by user48956

OK, here where I'm leaning:

好的,这里是我靠的地方:

class NDArrayWithColumns(np.ndarray):
    def __new__(cls, obj,  columns=None):
        obj = obj.view(cls)
        obj.columns = columns
        return obj

    def __array_finalize__(self, obj):
        if obj is None: return
        self.columns = getattr(obj, 'columns', None)

    @staticmethod
    def from_dataframe(df):
        cols = tuple(df.columns)
        arr = df.as_matrix(cols)
        return NDArrayWithColumns.from_array(arr,cols)

    @staticmethod
    def from_array(array,columns):
        if isinstance(array,NDArrayWithColumns):
            return array
        return NDArrayWithColumns(array,tuple(columns))

    def __str__(self):
        sup = np.ndarray.__str__(self)
        if self.columns:
            header = ", ".join(self.columns)
            header = "# " + header + "\n"
            return header+sup
        return sup

NAN = float("nan")
X = pd.DataFrame(dict(age=[40., NAN, 60.], sys_blood_pressure=[140.,150.,160.]))
arr = NDArrayWithColumns.from_dataframe(X)
print arr
print arr.columns
print arr.dtype

Gives:

给出:

# age, sys_blood_pressure
[[  40.  140.]
 [  nan  150.]
 [  60.  160.]]
('age', 'sys_blood_pressure')
float64

and can also be passed to types cython function expecting a ndarray[2,double_t].

并且也可以传递给需要 ndarray[2,double_t] 的类型 cython 函数。

UPDATE: this works pretty good except for some oddness when passing the type to ufuncs.

更新:除了将类型传递给 ufuncs 时有些奇怪之外,这很好