pandas 如何子类化pandas DataFrame?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/22155951/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to subclass pandas DataFrame?
提问by Lei
Subclassing pandas classes seems a common need but I could not find references on the subject. (It seems that pandas developers are still working on it: https://github.com/pydata/pandas/issues/60).
对 Pandas 类进行子类化似乎是一个普遍的需求,但我找不到有关该主题的参考资料。(似乎 Pandas 开发人员仍在努力:https: //github.com/pydata/pandas/issues/60)。
There are some SO threads on the subject, but I am hoping that someone here can provide a more systematic account on currently the best way to subclass pandas.DataFrame that satisfies two, I think, general requirements:
关于这个主题有一些 SO 线程,但我希望这里的某个人可以提供一个更系统的说明,说明当前对 pandas.DataFrame 进行子类化的最佳方法,我认为它满足两个一般要求:
import numpy as np
import pandas as pd
class MyDF(pd.DataFrame):
    # how to subclass pandas DataFrame?
    pass
mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print type(mydf)  # <class '__main__.MyDF'>
# Requirement 1: Instances of MyDF, when calling standard methods of DataFrame,
# should produce instances of MyDF.
mydf_sub = mydf[['A','C']]
print type(mydf_sub)  # <class 'pandas.core.frame.DataFrame'>
# Requirement 2: Attributes attached to instances of MyDF, when calling standard 
# methods of DataFrame, should still attach to the output.
mydf.myattr = 1
mydf_cp1 = MyDF(mydf)
mydf_cp2 = mydf.copy()
print hasattr(mydf_cp1, 'myattr')  # False
print hasattr(mydf_cp2, 'myattr')  # False
And is there any significant differences for subclassing pandas.Series? Thank you.
子类化pandas.Series 是否有任何显着差异?谢谢你。
回答by cjrieds
There is now an official guide on how to subclass Pandas data structures, which includes DataFrame as well as Series.
现在有一个关于如何对 Pandas 数据结构进行子类化的官方指南,其中包括 DataFrame 和 Series。
The guide is available here: https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas
该指南可在此处获得:https: //pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas
The guide mentions this subclassed DataFrame from the Geopandas project as a good example: https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py
该指南提到了来自 Geopandas 项目的这个子类 DataFrame 作为一个很好的例子:https: //github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py
As in HYRY's answer, it seems there are two things you're trying to accomplish:
就像在 HYRY 的回答中一样,您似乎要完成两件事:
- When calling methods on an instance of your class, return instances of the correct type (your type). For this, you can just add the _constructorproperty which should return your type.
- Adding attributes which will be attached to copies of your object. To do this, you need to store the names of these attributes in a list, as the special _metadataattribute.
- 在您的类的实例上调用方法时,返回正确类型(您的类型)的实例。为此,您只需添加_constructor应返回您的类型的属性。
- 添加将附加到对象副本的属性。为此,您需要将这些属性的名称存储在一个列表中,作为特殊_metadata属性。
Here's an example:
下面是一个例子:
class SubclassedDataFrame(DataFrame):
    _metadata = ['added_property']
    added_property = 1  # This will be passed to copies
    @property
    def _constructor(self):
        return SubclassedDataFrame
回答by HYRY
For Requirement 1, just define _constructor:
对于要求 1,只需定义_constructor:
import pandas as pd
import numpy as np
class MyDF(pd.DataFrame):
    @property
    def _constructor(self):
        return MyDF
mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print type(mydf)
mydf_sub = mydf[['A','C']]
print type(mydf_sub)
I think there is no simple solution for Requirement 2, I think you need define __init__, copy, or do something in _constructor, for example:
我认为这是对要求2没有简单的解决办法,我想你需要定义__init__,copy或做的东西_constructor,例如:
import pandas as pd
import numpy as np
class MyDF(pd.DataFrame):
    _attributes_ = "myattr1,myattr2"
    def __init__(self, *args, **kw):
        super(MyDF, self).__init__(*args, **kw)
        if len(args) == 1 and isinstance(args[0], MyDF):
            args[0]._copy_attrs(self)
    def _copy_attrs(self, df):
        for attr in self._attributes_.split(","):
            df.__dict__[attr] = getattr(self, attr, None)
    @property
    def _constructor(self):
        def f(*args, **kw):
            df = MyDF(*args, **kw)
            self._copy_attrs(df)
            return df
        return f
mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print type(mydf)
mydf_sub = mydf[['A','C']]
print type(mydf_sub)
mydf.myattr1 = 1
mydf_cp1 = MyDF(mydf)
mydf_cp2 = mydf.copy()
print mydf_cp1.myattr1, mydf_cp2.myattr1

