Python Pandas 按标签选择有时返回系列,有时返回 DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20383647/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:18:51  来源:igfitidea点击:

Pandas selecting by label sometimes return Series, sometimes returns DataFrame

pythonpandasdataframesliceseries

提问by jobevers

In Pandas, when I select a label that only has one entry in the index I get back a Series, but when I select an entry that has more then one entry I get back a data frame.

在 Pandas 中,当我选择索引中只有一个条目的标签时,我会返回一个系列,但是当我选择一个包含多个条目的条目时,我会返回一个数据框。

Why is that? Is there a way to ensure I always get back a data frame?

这是为什么?有没有办法确保我总是取回数据框?

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])

In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame

In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series

采纳答案by Dan Allan

Granted that the behavior is inconsistent, but I think it's easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc. There are other ways, but in my opinion this is the cleanest.

承认这种行为是不一致的,但我认为很容易想象这很方便的情况。无论如何,要每次都获得一个 DataFrame,只需将一个列表传递给loc. 还有其他方法,但在我看来这是最干净的。

In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame

In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame

回答by joris

You have an index with three index items 3. For this reason df.loc[3]will return a dataframe.

您有一个包含三个索引项的索引3。因此df.loc[3]将返回一个数据帧。

The reason is that you don't specify the column. So df.loc[3]selects three items of all columns (which is column 0), while df.loc[3,0]will return a Series. E.g. df.loc[1:2]also returns a dataframe, because you slice the rows.

原因是您没有指定列。因此df.loc[3]选择所有列(即 column 0)的三个项目,同时df.loc[3,0]将返回一个系列。例如,df.loc[1:2]还返回一个数据帧,因为您对行进行切片。

Selecting a single row (as df.loc[1]) returns a Series with the column names as the index.

选择单行 (as df.loc[1]) 返回一个以列名作为索引的系列。

If you want to be sure to always have a DataFrame, you can slice like df.loc[1:1]. Another option is boolean indexing (df.loc[df.index==1]) or the take method (df.take([0]), but this used location not labels!).

如果你想确保总是有一个 DataFrame,你可以像df.loc[1:1]. 另一种选择是布尔索引 ( df.loc[df.index==1]) 或 take 方法 ( df.take([0]),但这使用的是位置而不是标签!)。

回答by eyquem

You wrote in a comment to joris' answer:

您在对 joris 的回答的评论中写道:

"I don't understand the design decision for single rows to get convertedinto a series - why not a data frame with one row?"

“我不明白将单行转换为系列的设计决策- 为什么不是一行的数据框?”

A single row isn't convertedin a Series.
It ISa Series: No, I don't think so, in fact; see the edit

单行不会在系列中转换
一个系列:No, I don't think so, in fact; see the edit

The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Panel is a container for DataFrame objects. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.

http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure

将 Pandas 数据结构视为低维数据的灵活容器的最佳方式。例如,DataFrame 是 Series 的容器,Panel 是 DataFrame 对象的容器。我们希望能够以类似字典的方式从这些容器中插入和删除对象。

http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure

The data model of Pandas objects has been choosen like that. The reason certainly lies in the fact that it ensures some advantages I don't know (I don't fully understand the last sentence of the citation, maybe it's the reason)

Pandas 对象的数据模型就是这样选择的。原因当然在于它确保了一些我不知道的优势(我没有完全理解引文的最后一句话,也许是这个原因)

.

.

Edit : I don't agree with me

编辑:我不同意我的看法

A DataFrame can't be composed of elements that would beSeries, because the following code gives the same type "Series" as well for a row as for a column:

甲数据帧不能由将元件系列,因为下面的代码给出了相同类型的“系列”,以及为行作为用于柱:

import pandas as pd

df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])

print '-------- df -------------'
print df

print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])

print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])

result

结果

-------- df -------------
    0
2  11
3  12
3  13

------- df.loc[2] --------
0    11
Name: 2, dtype: int64
type(df.loc[1]) :  <class 'pandas.core.series.Series'>

--------- df[0] ----------
2    11
3    12
3    13
Name: 0, dtype: int64
type(df[0]) :  <class 'pandas.core.series.Series'>

So, there is no sense to pretend that a DataFrame is composed of Series because what would these said Series be supposed to be : columns or rows ? Stupid question and vision.

因此,假设 DataFrame 由 Series 组成是没有意义的,因为这些 Series 应该是什么:列或行?愚蠢的问题和愿景。

.

.

Then what is a DataFrame ?

那么什么是 DataFrame 呢?

In the previous version of this answer, I asked this question, trying to find the answer to the Why is that?part of the question of the OP and the similar interrogation single rows to get converted into a series - why not a data frame with one row?in one of his comment,
while the Is there a way to ensure I always get back a data frame?part has been answered by Dan Allan.

在这个答案的前一个版本中,我问了这个问题,试图在他的一个评论中找到Why is that?OP问题和类似审讯 的部分的答案single rows to get converted into a series - why not a data frame with one row?
而该Is there a way to ensure I always get back a data frame?部分已由Dan Allan回答。

Then, as the Pandas' docs cited above says that the pandas' data structures are best seen as containersof lower dimensional data, it seemed to me that the understanding of the whywould be found in the characteristcs of the nature of DataFrame structures.

然后,正如上面引用的 Pandas 文档所说,最好将 Pandas 的数据结构视为低维数据的容器,在我看来,对原因的理解可以在 DataFrame 结构的特性中找到。

However, I realized that this cited advice must not be taken as a precise description of the nature of Pandas' data structures.
This advice doesn't mean that a DataFrame is a container of Series.
It expresses that the mental representation of a DataFrame as a container of Series (either rows or columns according the option considered at one moment of a reasoning) is a good way to consider DataFrames, even if it isn't strictly the case in reality. "Good" meaning that this vision enables to use DataFrames with efficiency. That's all.

然而,我意识到这个引用的建议不能被视为对 Pandas 数据结构性质的精确描述。
这个建议并不意味着 DataFrame 是 Series 的容器。
它表示,将 DataFrame 心理表示为 Series 的容器(根据在推理的某个时刻考虑的选项,行或列)是考虑 DataFrame 的好方法,即使实际上并非严格如此。“好”意味着这个愿景能够高效地使用 DataFrame。就这样。

.

.

Then what is a DataFrame object ?

那么什么是 DataFrame 对象呢?

The DataFrameclass produces instances that have a particular structure originated in the NDFramebase class, itself derived from the PandasContainerbase class that is also a parent class of the Seriesclass.
Note that this is correct for Pandas until version 0.12. In the upcoming version 0.13, Serieswill derive also from NDFrameclass only.

所述数据帧类产生具有特定结构起源于实例NDFrame基类,本身从派生 PandasContainer基类,也是一个父类的系列类。
请注意,这对于 0.12 版之前的 Pandas 是正确的。在即将发布的 0.13 版本中,Series也将仅从NDFrame类派生。

# with pandas 0.12

from pandas import Series
print 'Series  :\n',Series
print 'Series.__bases__  :\n',Series.__bases__

from pandas import DataFrame
print '\nDataFrame  :\n',DataFrame
print 'DataFrame.__bases__  :\n',DataFrame.__bases__

print '\n-------------------'

from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__  :\n',NDFrame.__bases__

from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__  :\n',PandasContainer.__bases__

from pandas.core.base import PandasObject
print '\nPandasObject.__bases__  :\n',PandasObject.__bases__

from pandas.core.base import StringMixin
print '\nStringMixin.__bases__  :\n',StringMixin.__bases__

result

结果

Series  :
<class 'pandas.core.series.Series'>
Series.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)

DataFrame  :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__  :
(<class 'pandas.core.generic.NDFrame'>,)

-------------------

NDFrame.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>,)

PandasContainer.__bases__  :
(<class 'pandas.core.base.PandasObject'>,)

PandasObject.__bases__  :
(<class 'pandas.core.base.StringMixin'>,)

StringMixin.__bases__  :
(<type 'object'>,)

So my understanding is now that a DataFrame instance has certain methods that have been crafted in order to control the way data are extracted from rows and columns.

所以我现在的理解是,DataFrame 实例具有特定的方法,这些方法是为了控制从行和列中提取数据的方式而设计的。

The ways these extracting methods work are described in this page: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
We find in it the method given by Dan Allan and other methods.

这些提取方法的工作方式在此页面中描述: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
我们在其中找到了 Dan Allan 给出的方法和其他方法。

Why these extracting methods have been crafted as they were ?
That's certainly because they have been appraised as the ones giving the better possibilities and ease in data analysis.
It's precisely what is expressed in this sentence:

为什么这些提取方法被制作成这样?
这当然是因为它们被认为在数据分析中提供了更好的可能性和便利性。
正是这句话所表达的:

The best way to think about the pandas data structures is as flexible containers for lower dimensional data.

将 Pandas 数据结构视为低维数据的灵活容器的最佳方式。

The whyof the extraction of data from a DataFRame instance doesn't lies in its structure, it lies in the whyof this structure. I guess that the structure and functionning of the Pandas' data structure have been chiseled in order to be as much intellectually intuitive as possible, and that to understand the details, one must read the blog of Wes McKinney.

为什么数据从数据帧的实例提取的不在于它的结构,它位于为什么这种结构。我猜想 Pandas 数据结构的结构和功能已经被凿刻,以便尽可能地在智力上直观,要了解细节,必须阅读 Wes McKinney 的博客。

回答by Ajit

If the objective is to get a subset of the data set using the index, it is best to avoid using locor iloc. Instead you should use syntax similar to this :

如果目标是使用索引获取数据集的子集,最好避免使用lociloc。相反,您应该使用类似于此的语法:

df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3] 
isinstance(result, pd.DataFrame) # True

result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True

回答by user4422

Use df['columnName']to get a Series and df[['columnName']]to get a Dataframe.

使用df['columnName']得到一个系列,并df[['columnName']]得到一个数据帧。

回答by Wouter

If you also select on the index of the dataframe then the result can be either a DataFrame or a Series orit can be a Series or a scalar (single value).

如果您还选择了数据帧的索引,则结果可以是数据帧或系列也可以是系列或标量(单个值)。

This function ensures that you always get a list from your selection (if the df, index and column are valid):

这个函数确保你总是从你的选择中得到一个列表(如果 df、index 和 column 是有效的):

def get_list_from_df_column(df, index, column):
    df_or_series = df.loc[index,[column]] 
    # df.loc[index,column] is also possible and returns a series or a scalar
    if isinstance(df_or_series, pd.Series):
        resulting_list = df_or_series.tolist() #get list from series
    else:
        resulting_list = df_or_series[column].tolist() 
        # use the column key to get a series from the dataframe
    return(resulting_list)

回答by Colin Anthony

The TLDR

TLDR

When using loc

使用时 loc

df.loc[:]= Dataframe

df.loc[:]= 数据

df.loc[int]= Dataframeif you have more than one column and Seriesif you have only 1 column in the dataframe

df.loc[int]=如果您有不止一列,则为数据框,如果数据框中只有 1 列,则为系列

df.loc[:, ["col_name"]]= Dataframe

df.loc[:, ["col_name"]]= 数据

df.loc[:, "col_name"]= Series

df.loc[:, "col_name"]=系列

Not using loc

不使用 loc

df["col_name"]= Series

df["col_name"]=系列

df[["col_name"]]= Dataframe

df[["col_name"]]= 数据