在 python 中使用 .loc 进行选择

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44890713/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 00:35:36  来源:igfitidea点击:

Selection with .loc in python

pythonpandasdataframeipythonselection

提问by bugsyb

I saw this code in someone's iPython notebook, and I'm very confused as to how this code works. As far as I understood, pd.loc[] is used as a location based indexer where the format is:

我在某人的 iPython notebook 中看到了这段代码,我很困惑这段代码是如何工作的。据我了解, pd.loc[] 用作基于位置的索引器,其格式为:

df.loc[index,column_name]

However, in this case, the first index seems to be a series of boolean values. Could someone please explain to me how this selection works. I tried to read through the documentation but I couldn't figure out an explanation. Thanks!

但是,在这种情况下,第一个索引似乎是一系列布尔值。有人可以向我解释这个选择是如何工作的。我试图通读文档,但找不到解释。谢谢!

iris_data.loc[iris_data['class'] == 'versicolor', 'class'] = 'Iris-versicolor'

enter image description here

在此处输入图片说明

回答by piRSquared

pd.DataFrame.loccan take one or two indexers. For the rest of the post, I'll represent the first indexer as iand the second indexer as j.

pd.DataFrame.loc可以带一两个索引器。对于本文的其余部分,我将第一个索引器表示为i,第二个索引器表示为j

If only one indexer is provided, it applies to the index of the dataframe and the missing indexer is assumed to represent all columns. So the following two examples are equivalent.

如果仅提供一个索引器,则它适用于数据帧的索引,并且假定缺少的索引器代表所有列。所以下面两个例子是等价的。

  1. df.loc[i]
  2. df.loc[i, :]
  1. df.loc[i]
  2. df.loc[i, :]

Where :is used to represent all columns.

where:用于表示所有列。

If both indexers are present, ireferences index values and jreferences column values.

如果两个索引器都存在,则i引用索引值并j引用列值。



Now we can focus on what types of values iand jcan assume. Let's use the following dataframe dfas our example:

现在我们可以专注于什么类型的值ij可以假设。让我们使用以下数据框df作为示例:

    df = pd.DataFrame([[1, 2], [3, 4]], index=['A', 'B'], columns=['X', 'Y'])

lochas been written such that iand jcan be

loc已经写成i并且j可以

  1. scalarsthat should be values in the respective index objects

    df.loc['A', 'Y']
    
    2
    
  2. arrayswhose elements are also members of the respective index object (notice that the order of the array I pass to locis respected

    df.loc[['B', 'A'], 'X']
    
    B    3
    A    1
    Name: X, dtype: int64
    
    • Notice the dimensionality of the return object when passing arrays. iis an array as it was above, locreturns an object in which an index with those values is returned. In this case, because jwas a scalar, locreturned a pd.Seriesobject. We could've manipulated this to return a dataframe if we passed an array for iand j, and the array could've have just been a single value'd array.

      df.loc[['B', 'A'], ['X']]
      
         X
      B  3
      A  1
      
  3. boolean arrayswhose elements are Trueor Falseand whose length matches the length of the respective index. In this case, locsimply grabs the rows (or columns) in which the boolean array is True.

    df.loc[[True, False], ['X']]
    
       X
    A  1
    
  1. 应该是相应索引对象中的值的标量

    df.loc['A', 'Y']
    
    2
    
  2. 其元素也是相应索引对象成员的数组(注意我传递给的数组的顺序loc是被尊重的

    df.loc[['B', 'A'], 'X']
    
    B    3
    A    1
    Name: X, dtype: int64
    
    • 注意传递数组时返回对象的维度。 i是上面的数组,loc返回一个对象,其中返回具有这些值的索引。在这种情况下,因为j是标量,所以loc返回了一个pd.Series对象。如果我们为iand传递一个数组,我们可以操纵它返回一个数据帧j,并且该数组可能只是一个单值数组。

      df.loc[['B', 'A'], ['X']]
      
         X
      B  3
      A  1
      
  3. 元素为TrueorFalse且长度与相应索引的长度匹配的布尔数组。在这种情况下,loc只需获取布尔数组所在的行(或列)True

    df.loc[[True, False], ['X']]
    
       X
    A  1
    


In addition to what indexers you can pass to loc, it also enables you to make assignments. Now we can break down the line of code you provided.

除了您可以传递给哪些索引器之外loc,它还使您能够进行分配。现在我们可以分解您提供的代码行。

iris_data.loc[iris_data['class'] == 'versicolor', 'class'] = 'Iris-versicolor'
  1. iris_data['class'] == 'versicolor'returns a boolean array.
  2. classis a scalar that represents a value in the columns object.
  3. iris_data.loc[iris_data['class'] == 'versicolor', 'class']returns a pd.Seriesobject consisting of the 'class'column for all rows where 'class'is 'versicolor'
  4. When used with an assignment operator:

    iris_data.loc[iris_data['class'] == 'versicolor', 'class'] = 'Iris-versicolor'
    

    We assign 'Iris-versicolor'for all elements in column 'class'where 'class'was 'versicolor'

  1. iris_data['class'] == 'versicolor'返回一个布尔数组。
  2. class是一个标量,表示列对象中的值。
  3. iris_data.loc[iris_data['class'] == 'versicolor', 'class']返回一个pd.Series'class'所有行的列组成的对象,其中'class''versicolor'
  4. 与赋值运算符一起使用时:

    iris_data.loc[iris_data['class'] == 'versicolor', 'class'] = 'Iris-versicolor'
    

    我们分配'Iris-versicolor'在列中的所有元素'class',其中'class''versicolor'

回答by LangeHaare

This is using dataframes from the pandaspackage. The "index" part can be either a single index, a list of indices, or a list of booleans. This can be read about in the documentation: https://pandas.pydata.org/pandas-docs/stable/indexing.html

这是使用pandas包中的数据帧。“索引”部分可以是单个索引、索引列表或布尔值列表。这可以在文档中阅读:https: //pandas.pydata.org/pandas-docs/stable/indexing.html

So the indexpart specifies a subset of the rows to pull out, and the (optional) column_namespecifies the column you want to work with from that subset of the dataframe. So if you want to update the 'class' column but only in rows where the class is currently set as 'versicolor', you might do something like what you list in the question:

因此,该index部分指定要提取的行的子集,(可选)column_name指定要从数据帧的该子集中使用的列。因此,如果您想更新“类”列但仅在类当前设置为“versicolor”的行中,您可能会执行类似问题中列出的操作:

iris_data.loc[iris_data['class'] == 'versicolor', 'class'] = 'Iris-versicolor'

回答by Aashish Kumar

It's a pandas data-frame and it's using label base selection tool with df.locand in it, there are two inputs, one for the row and the other one for the column, so in the row input it's selecting all those row values where the value saved in the column classis versicolor, and in the column input it's selecting the column with label class, and assigning Iris-versicolorvalue to them. So basically it's replacing all the cells of column classwith value versicolorwith Iris-versicolor.

这是一个熊猫数据框,它使用标签库选择工具df.loc,其中有两个输入,一个用于行,另一个用于列,因此在行输入中,它选择保存值的所有行值在列中classversicolor,在列输入中它选择带有标签的列class,并Iris-versicolor为它们分配值。所以基本上它替换列的所有单元格class与价值versicolorIris-versicolor

回答by Def_Os

It's pandaslabel-based selection, as explained here: https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label

这是pandas基于标签的选择,如下所述:https: //pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label

The boolean array is basically a selection method using a mask.

布尔数组基本上是一种使用掩码的选择方法。