对于 Pandas DataFrame，使用方括号或点访问列有什么区别？

Question

提问by Alberto Segundo

i.e.:

IE：

import pandas

d = {'col1': 2, 'col2': 2.5}
df = pandas.DataFrame(data=d, index=[0])

print(df['col2'])
print(df.col2)

The output is the same.

输出是一样的。

Does this answer apply to this case?

这个答案适用于这种情况吗？

What's the difference between the square bracket and dot notations in Python?

Python中的方括号和点符号有什么区别？

Answer 1

采纳答案by Julien Marrec

The "dot notation", i.e. df.col2is the attribute accessthat's exposed as a convenience.

“点符号”，即为方便而公开df.col2的属性访问。

You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:

您可以直接作为属性访问 Series 上的索引、DataFrame 上的列和 Panel 上的项目：

df['col2']does the same: it returns a pd.Seriesof the column.

df['col2']做同样的事情：它返回pd.Series列的 a。

A few caveats about attribute access:

关于属性访问的一些注意事项：

you cannot add a column (df.new_col = xwon't work, worse: it will silentlyactually create a new attribute rather than a column - think monkey-patching here)
it won't work if you have spaces in the column name or if the column name is an integer.

你不能添加一个列（df.new_col = x不起作用，更糟糕的是：它实际上会默默地创建一个新属性而不是一个列——想想这里的猴子补丁）
如果列名中有空格或者列名是整数，它将不起作用。

Answer 2

回答by BrenBarn

They are the same as long you're accessing a single column with a simple name, but you can do more with the bracket notation. You can only use df.colif the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum). With brackets you can select multiple columns (e.g., df[['col1', 'col2']]) or add a new column (df['newcol'] = ...), which can't be done with dot access.

只要您访问具有简单名称的单个列，它们就相同，但是您可以使用括号表示法做更多的事情。您只能df.col在列名是有效的 Python 标识符时使用（例如，不包含空格和其他此类内容）。此外，如果您的列名与 Pandas 方法名（如sum）冲突，您可能会遇到意外。使用括号，您可以选择多列（例如，df[['col1', 'col2']]）或添加新列（df['newcol'] = ...），这是点访问无法完成的。

The other question you linked to applies, but that is a much more general question. Python objects get to define how the .and []operators apply to them. Pandas DataFrames have chosen to make them the same for this limited case of accessing single columns, with the caveats described above.

您链接到的另一个问题适用，但这是一个更普遍的问题。Python 对象可以定义.和[]运算符如何应用于它们。Pandas DataFrames 选择在访问单列的这种有限情况下使它们相同，并具有上述注意事项。

Answer 3

回答by YaOzI

Short answer for differences:

差异的简短回答：

[]indexing(squared brackets access) has the full functionaly to operate on DataFrame column data.
While attribute access(dot access) is mainly for convinience to access existing DataFrame column data, but occasionally has its limitations (e.g. special column names, creating a new column).

[]索引（方括号访问）具有对 DataFrame 列数据进行操作的完整功能。
虽然属性访问（点访问）主要是为了方便访问现有的DataFrame列数据，但偶尔也有其局限性（例如特殊的列名，创建新列）。

More explaination, Seires and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documentedand can be easily understood. Just few points to note:

多解释一下，Seires和DataFrame是pandas中的核心类和数据结构，当然它们也是Python类，所以在涉及pandas DataFrame和普通Python对象的属性访问时，会有一些细微的区别。但它有据可查，很容易理解。只需注意几点：

In Python, users may dynamically add data attributes of their own to an instance object using attribute access.

>>> class Dog(object):
...     pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}

In pandas, indexand columnare closely related to the data structure, you may accessan index on a Series, column on a DataFrame as an attribute.

>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
   a  b
0  7  6
1  5  8
>>> vars(df)
{'_is_copy': None, 
 '_data': BlockManager
    Items: Index(['a', 'b'], dtype='object')
    Axis 1: RangeIndex(start=0, stop=2, step=1)
    IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
 '_item_cache': {}}

But, pandas attribute access is mainly a convinience for reading from and modifying an existing elementof a Series or column of a DataFrame.
```
>>> df.a
0    7
1    5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
   a  b
0  7  1
1  5  1
```
And, the convinience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier 1, space baror conflicts with an existing method name.
```
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
   space bar  1  loc  min  index
0          4  4    4    8      9
1          3  0    1    2      3
```

In these cases, the .loc, .ilocand []indexing is the defined wayto fullly access/operate index and columns of Series and DataFrame objects.

>>> df_special_col_names['space bar']
0    4
1    3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0    8
1    2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0    4
1    0
Name: 1, dtype: int64

Another important difference is when tyring to create a new column for DataFrame. As you can see, df.c = df.a + df.bjust created an new attribute along side to the core data structure, so starting from version 0.21.0and later, this behavior will raise a UserWarning(silent no more).

>>> df
   a  b
0  7  1
1  5  1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
>>> df['d'] = df.a + df.b
>>> df
   a  b  d
0  7  1  8
1  5  1  6
>>> df.c
0    8
1    6
dtype: int64
>>> vars(df)
{'_is_copy': None, 
 '_data': 
    BlockManager
    Items: Index(['a', 'b', 'd'], dtype='object')
    Axis 1: RangeIndex(start=0, stop=2, step=1)
    IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
    IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64, 
 '_item_cache': {},
 'c': 0    8
      1    6
      dtype: int64}

Finally, to create a new column for DataFrame, never use attribute access, the correct way is to use either []or .locindexing:

>>> df
   a  b
0  7  6
1  5  8
>>> df['c'] = df.a + df.b 
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
   a  b   c
0  7  6  13
1  5  8  13

在 Python 中，用户可以使用属性访问将自己的数据属性动态添加到实例对象中。

>>> class Dog(object):
...     pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}

在Pandas，索引和列密切相关的数据结构，您可以访问在数据帧上的系列指数，列作为属性。

>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
   a  b
0  7  6
1  5  8
>>> vars(df)
{'_is_copy': None, 
 '_data': BlockManager
    Items: Index(['a', 'b'], dtype='object')
    Axis 1: RangeIndex(start=0, stop=2, step=1)
    IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
 '_item_cache': {}}

但是，pandas 属性访问主要是方便读取和修改数据帧的系列或列的现有元素。
```
>>> df.a
0    7
1    5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
   a  b
0  7  1
1  5  1
```
而且，便利性是对完整功能的权衡。例如，您可以创建一个带有列名的 DataFrame 对象['space bar', '1', 'loc', 'min', 'index']，但您不能将它们作为属性访问，因为它们要么不是有效的 Python 标识符1，space bar要么与现有的方法名称冲突。
```
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
   space bar  1  loc  min  index
0          4  4    4    8      9
1          3  0    1    2      3
```

在这些情况下，.loc,.iloc和[]索引是完全访问/操作索引和 Series 和 DataFrame 对象的列的定义方式。

>>> df_special_col_names['space bar']
0    4
1    3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0    8
1    2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0    4
1    0
Name: 1, dtype: int64

另一个重要的区别是何时为 DataFrame 创建新列。如您所见，df.c = df.a + df.b刚刚为核心数据结构创建了一个新属性，因此从 version0.21.0和更高版本开始，此行为将引发UserWarning（不再沉默）。

>>> df
   a  b
0  7  1
1  5  1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
>>> df['d'] = df.a + df.b
>>> df
   a  b  d
0  7  1  8
1  5  1  6
>>> df.c
0    8
1    6
dtype: int64
>>> vars(df)
{'_is_copy': None, 
 '_data': 
    BlockManager
    Items: Index(['a', 'b', 'd'], dtype='object')
    Axis 1: RangeIndex(start=0, stop=2, step=1)
    IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
    IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64, 
 '_item_cache': {},
 'c': 0    8
      1    6
      dtype: int64}

最后，要为 DataFrame 创建一个新列，永远不要使用属性 access，正确的方法是使用[]或.locindexing：

>>> df
   a  b
0  7  6
1  5  8
>>> df['c'] = df.a + df.b 
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
   a  b   c
0  7  6  13
1  5  8  13

对于 Pandas DataFrame，使用方括号或点访问列有什么区别？

提问by Alberto Segundo

采纳答案by Julien Marrec

回答by BrenBarn

回答by YaOzI

相关推荐

最近更新

标签

对于 Pandas DataFrame，使用方括号或点访问列有什么区别？

提问by Alberto Segundo

采纳答案by Julien Marrec

回答by BrenBarn

回答by YaOzI

相关推荐

在 Pandas 数据框中查找每三列的平均值

将 pandas pd 转换为 numpy 数组并返回

pandas 将字典列表转换为数据框

使用动态名称在 Pandas 中创建新数据框还会添加新列

相关推荐

最近更新

标签