pandas 如何子类化pandas DataFrame？

Question

提问by Lei

Subclassing pandas classes seems a common need but I could not find references on the subject. (It seems that pandas developers are still working on it: https://github.com/pydata/pandas/issues/60).

对 Pandas 类进行子类化似乎是一个普遍的需求，但我找不到有关该主题的参考资料。（似乎 Pandas 开发人员仍在努力：https: //github.com/pydata/pandas/issues/60）。

There are some SO threads on the subject, but I am hoping that someone here can provide a more systematic account on currently the best way to subclass pandas.DataFrame that satisfies two, I think, general requirements:

关于这个主题有一些 SO 线程，但我希望这里的某个人可以提供一个更系统的说明，说明当前对 pandas.DataFrame 进行子类化的最佳方法，我认为它满足两个一般要求：

import numpy as np
import pandas as pd

class MyDF(pd.DataFrame):
    # how to subclass pandas DataFrame?
    pass

mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print type(mydf)  # <class '__main__.MyDF'>

# Requirement 1: Instances of MyDF, when calling standard methods of DataFrame,
# should produce instances of MyDF.
mydf_sub = mydf[['A','C']]
print type(mydf_sub)  # <class 'pandas.core.frame.DataFrame'>

# Requirement 2: Attributes attached to instances of MyDF, when calling standard 
# methods of DataFrame, should still attach to the output.
mydf.myattr = 1
mydf_cp1 = MyDF(mydf)
mydf_cp2 = mydf.copy()
print hasattr(mydf_cp1, 'myattr')  # False
print hasattr(mydf_cp2, 'myattr')  # False

And is there any significant differences for subclassing pandas.Series? Thank you.

子类化pandas.Series 是否有任何显着差异？谢谢你。

Answer 1

回答by cjrieds

There is now an official guide on how to subclass Pandas data structures, which includes DataFrame as well as Series.

现在有一个关于如何对 Pandas 数据结构进行子类化的官方指南，其中包括 DataFrame 和 Series。

The guide is available here: https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas

该指南可在此处获得：https: //pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas

The guide mentions this subclassed DataFrame from the Geopandas project as a good example: https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py

该指南提到了来自 Geopandas 项目的这个子类 DataFrame 作为一个很好的例子：https: //github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py

As in HYRY's answer, it seems there are two things you're trying to accomplish:

就像在 HYRY 的回答中一样，您似乎要完成两件事：

When calling methods on an instance of your class, return instances of the correct type (your type). For this, you can just add the _constructorproperty which should return your type.
Adding attributes which will be attached to copies of your object. To do this, you need to store the names of these attributes in a list, as the special _metadataattribute.

在您的类的实例上调用方法时，返回正确类型（您的类型）的实例。为此，您只需添加_constructor应返回您的类型的属性。
添加将附加到对象副本的属性。为此，您需要将这些属性的名称存储在一个列表中，作为特殊_metadata属性。

Here's an example:

下面是一个例子：

class SubclassedDataFrame(DataFrame):
    _metadata = ['added_property']
    added_property = 1  # This will be passed to copies

    @property
    def _constructor(self):
        return SubclassedDataFrame

Answer 2

回答by HYRY

For Requirement 1, just define _constructor:

对于要求 1，只需定义_constructor：

import pandas as pd
import numpy as np

class MyDF(pd.DataFrame):
    @property
    def _constructor(self):
        return MyDF


mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print type(mydf)

mydf_sub = mydf[['A','C']]
print type(mydf_sub)

I think there is no simple solution for Requirement 2, I think you need define __init__, copy, or do something in _constructor, for example:

我认为这是对要求2没有简单的解决办法，我想你需要定义__init__，copy或做的东西_constructor，例如：

import pandas as pd
import numpy as np

class MyDF(pd.DataFrame):
    _attributes_ = "myattr1,myattr2"

    def __init__(self, *args, **kw):
        super(MyDF, self).__init__(*args, **kw)
        if len(args) == 1 and isinstance(args[0], MyDF):
            args[0]._copy_attrs(self)

    def _copy_attrs(self, df):
        for attr in self._attributes_.split(","):
            df.__dict__[attr] = getattr(self, attr, None)

    @property
    def _constructor(self):
        def f(*args, **kw):
            df = MyDF(*args, **kw)
            self._copy_attrs(df)
            return df
        return f

mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print type(mydf)

mydf_sub = mydf[['A','C']]
print type(mydf_sub)

mydf.myattr1 = 1
mydf_cp1 = MyDF(mydf)
mydf_cp2 = mydf.copy()
print mydf_cp1.myattr1, mydf_cp2.myattr1

pandas 如何子类化pandas DataFrame？

提问by Lei

回答by cjrieds

回答by HYRY

相关推荐

最近更新

标签

pandas 如何子类化pandas DataFrame？

提问by Lei

回答by cjrieds

回答by HYRY

相关推荐

pandas dtype 从对象到字符串的转换

Pandas：使用数据帧的多列作为另一个的索引

Python / Pandas：数据帧索引中有多少层？

pandas Python经验分布函数（ecdf）实现

相关推荐

最近更新

标签