对于 Pandas DataFrame,使用方括号或点访问列有什么区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41130255/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:37:29  来源:igfitidea点击:

For Pandas DataFrame, what's the difference between using squared brackets or dot to access a column?

pythonpandasdataframeindexing

提问by Alberto Segundo

i.e.:

IE:

import pandas

d = {'col1': 2, 'col2': 2.5}
df = pandas.DataFrame(data=d, index=[0])

print(df['col2'])
print(df.col2)

The output is the same.

输出是一样的。

Does this answer apply to this case?

这个答案适用于这种情况吗?

What's the difference between the square bracket and dot notations in Python?

Python中的方括号和点符号有什么区别?

采纳答案by Julien Marrec

The "dot notation", i.e. df.col2is the attribute accessthat's exposed as a convenience.

“点符号”,即为方便而公开df.col2属性访问

You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:

您可以直接作为属性访问 Series 上的索引、DataFrame 上的列和 Panel 上的项目:

df['col2']does the same: it returns a pd.Seriesof the column.

df['col2']做同样的事情:它返回pd.Series列的 a。

A few caveats about attribute access:

关于属性访问的一些注意事项:

  • you cannot add a column (df.new_col = xwon't work, worse: it will silentlyactually create a new attribute rather than a column - think monkey-patching here)
  • it won't work if you have spaces in the column name or if the column name is an integer.
  • 你不能添加一个列(df.new_col = x不起作用,更糟糕的是:它实际上会默默地创建一个新属性而不是一个列——想想这里的猴子补丁)
  • 如果列名中有空格或者列名是整数,它将不起作用。

回答by BrenBarn

They are the same as long you're accessing a single column with a simple name, but you can do more with the bracket notation. You can only use df.colif the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum). With brackets you can select multiple columns (e.g., df[['col1', 'col2']]) or add a new column (df['newcol'] = ...), which can't be done with dot access.

只要您访问具有简单名称的单个列,它们就相同,但是您可以使用括号表示法做更多的事情。您只能df.col在列名是有效的 Python 标识符时使用(例如,不包含空格和其他此类内容)。此外,如果您的列名与 Pandas 方法名(如sum)冲突,您可能会遇到意外。使用括号,您可以选择多列(例如,df[['col1', 'col2']])或添加新列(df['newcol'] = ...),这是点访问无法完成的。

The other question you linked to applies, but that is a much more general question. Python objects get to define how the .and []operators apply to them. Pandas DataFrames have chosen to make them the same for this limited case of accessing single columns, with the caveats described above.

您链接到的另一个问题适用,但这是一个更普遍的问题。Python 对象可以定义.[]运算符如何应用于它们。Pandas DataFrames 选择在访问单列的这种有限情况下使它们相同,并具有上述注意事项。

回答by YaOzI

Short answer for differences:

差异的简短回答:

  • []indexing(squared brackets access) has the full functionaly to operate on DataFrame column data.
  • While attribute access(dot access) is mainly for convinience to access existing DataFrame column data, but occasionally has its limitations (e.g. special column names, creating a new column).
  • []索引(方括号访问)具有对 DataFrame 列数据进行操作的完整功能。
  • 虽然属性访问(点访问)主要是为了方便访问现有的DataFrame列数据,但偶尔也有其局限性(例如特殊的列名,创建新列)。


More explaination, Seires and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documentedand can be easily understood. Just few points to note:

多解释一下,Seires和DataFrame是pandas中的核心类和数据结构,当然它们也是Python类,所以在涉及pandas DataFrame和普通Python对象的属性访问时,会有一些细微的区别。但它有据可查,很容易理解。只需注意几点:

  1. In Python, users may dynamically add data attributes of their own to an instance object using attribute access.

    >>> class Dog(object):
    ...     pass
    >>> dog = Dog()
    >>> vars(dog)
    {}
    >>> superdog = Dog()
    >>> vars(superdog)
    {}
    >>> dog.legs = 'I can run.'
    >>> superdog.wings = 'I can fly.'
    >>> vars(dog)
    {'legs': 'I can run.'}
    >>> vars(superdog)
    {'wings': 'I can fly.'}
    
  2. In pandas, indexand columnare closely related to the data structure, you may accessan index on a Series, column on a DataFrame as an attribute.

    >>> import pandas as pd
    >>> import numpy as np
    >>> data = np.random.randint(low=0, high=10, size=(2,2))
    >>> df = pd.DataFrame(data, columns=['a', 'b'])
    >>> df
       a  b
    0  7  6
    1  5  8
    >>> vars(df)
    {'_is_copy': None, 
     '_data': BlockManager
        Items: Index(['a', 'b'], dtype='object')
        Axis 1: RangeIndex(start=0, stop=2, step=1)
        IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
     '_item_cache': {}}
    
  3. But, pandas attribute access is mainly a convinience for reading from and modifying an existing elementof a Series or column of a DataFrame.

    >>> df.a
    0    7
    1    5
    Name: a, dtype: int64
    >>> df.b = [1, 1]
    >>> df
       a  b
    0  7  1
    1  5  1
    
  4. And, the convinience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier 1, space baror conflicts with an existing method name.

    >>> data = np.random.randint(0, 10, size=(2, 5))
    >>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
    >>> df_special_col_names
       space bar  1  loc  min  index
    0          4  4    4    8      9
    1          3  0    1    2      3
    
  5. In these cases, the .loc, .ilocand []indexing is the defined wayto fullly access/operate index and columns of Series and DataFrame objects.

    >>> df_special_col_names['space bar']
    0    4
    1    3
    Name: space bar, dtype: int64
    >>> df_special_col_names.loc[:, 'min']
    0    8
    1    2
    Name: min, dtype: int64
    >>> df_special_col_names.iloc[:, 1]
    0    4
    1    0
    Name: 1, dtype: int64
    
  6. Another important difference is when tyring to create a new column for DataFrame. As you can see, df.c = df.a + df.bjust created an new attribute along side to the core data structure, so starting from version 0.21.0and later, this behavior will raise a UserWarning(silent no more).

    >>> df
       a  b
    0  7  1
    1  5  1
    >>> df.c = df.a + df.b
    __main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
    >>> df['d'] = df.a + df.b
    >>> df
       a  b  d
    0  7  1  8
    1  5  1  6
    >>> df.c
    0    8
    1    6
    dtype: int64
    >>> vars(df)
    {'_is_copy': None, 
     '_data': 
        BlockManager
        Items: Index(['a', 'b', 'd'], dtype='object')
        Axis 1: RangeIndex(start=0, stop=2, step=1)
        IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
        IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64, 
     '_item_cache': {},
     'c': 0    8
          1    6
          dtype: int64}
    
  7. Finally, to create a new column for DataFrame, never use attribute access, the correct way is to use either []or .locindexing:

    >>> df
       a  b
    0  7  6
    1  5  8
    >>> df['c'] = df.a + df.b 
    >>> # OR
    >>> df.loc[:, 'c'] = df.a + df.b
    >>> df # c is an new added column
       a  b   c
    0  7  6  13
    1  5  8  13
    
  1. 在 Python 中,用户可以使用属性访问将自己的数据属性动态添加到实例对象中。

    >>> class Dog(object):
    ...     pass
    >>> dog = Dog()
    >>> vars(dog)
    {}
    >>> superdog = Dog()
    >>> vars(superdog)
    {}
    >>> dog.legs = 'I can run.'
    >>> superdog.wings = 'I can fly.'
    >>> vars(dog)
    {'legs': 'I can run.'}
    >>> vars(superdog)
    {'wings': 'I can fly.'}
    
  2. 在Pandas,索引密切相关的数据结构,您可以访问在数据帧上的系列指数,列作为属性

    >>> import pandas as pd
    >>> import numpy as np
    >>> data = np.random.randint(low=0, high=10, size=(2,2))
    >>> df = pd.DataFrame(data, columns=['a', 'b'])
    >>> df
       a  b
    0  7  6
    1  5  8
    >>> vars(df)
    {'_is_copy': None, 
     '_data': BlockManager
        Items: Index(['a', 'b'], dtype='object')
        Axis 1: RangeIndex(start=0, stop=2, step=1)
        IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
     '_item_cache': {}}
    
  3. 但是,pandas 属性访问主要是方便读取和修改数据帧的系列或列的现有元素

    >>> df.a
    0    7
    1    5
    Name: a, dtype: int64
    >>> df.b = [1, 1]
    >>> df
       a  b
    0  7  1
    1  5  1
    
  4. 而且,便利性是对完整功能的权衡。例如,您可以创建一个带有列名的 DataFrame 对象['space bar', '1', 'loc', 'min', 'index'],但您不能将它们作为属性访问,因为它们要么不是有效的 Python 标识符1space bar要么与现有的方法名称冲突。

    >>> data = np.random.randint(0, 10, size=(2, 5))
    >>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
    >>> df_special_col_names
       space bar  1  loc  min  index
    0          4  4    4    8      9
    1          3  0    1    2      3
    
  5. 在这些情况下,.loc,.iloc[]索引是完全访问/操作索引和 Series 和 DataFrame 对象的列的定义方式

    >>> df_special_col_names['space bar']
    0    4
    1    3
    Name: space bar, dtype: int64
    >>> df_special_col_names.loc[:, 'min']
    0    8
    1    2
    Name: min, dtype: int64
    >>> df_special_col_names.iloc[:, 1]
    0    4
    1    0
    Name: 1, dtype: int64
    
  6. 另一个重要的区别是何时为 DataFrame 创建新列。如您所见,df.c = df.a + df.b刚刚为核心数据结构创建了一个新属性,因此从 version0.21.0和更高版本开始,此行为将引发UserWarning(不再沉默)。

    >>> df
       a  b
    0  7  1
    1  5  1
    >>> df.c = df.a + df.b
    __main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
    >>> df['d'] = df.a + df.b
    >>> df
       a  b  d
    0  7  1  8
    1  5  1  6
    >>> df.c
    0    8
    1    6
    dtype: int64
    >>> vars(df)
    {'_is_copy': None, 
     '_data': 
        BlockManager
        Items: Index(['a', 'b', 'd'], dtype='object')
        Axis 1: RangeIndex(start=0, stop=2, step=1)
        IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
        IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64, 
     '_item_cache': {},
     'c': 0    8
          1    6
          dtype: int64}
    
  7. 最后,要为 DataFrame 创建一个新列,永远不要使用属性 access,正确的方法是使用[].locindexing

    >>> df
       a  b
    0  7  6
    1  5  8
    >>> df['c'] = df.a + df.b 
    >>> # OR
    >>> df.loc[:, 'c'] = df.a + df.b
    >>> df # c is an new added column
       a  b   c
    0  7  6  13
    1  5  8  13