pandas pandas数据框的条件过滤

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44608005/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:48:54  来源:igfitidea点击:

Conditional filtering of pandas data frame

pythonpandasfilterslice

提问by NCL

I have a pandas data frame about football results. Each row of the dataframe represents a football match. The information of each match are:

我有一个关于足球成绩的Pandas数据框。数据帧的每一行代表一场足球比赛。每场比赛的信息是:

Day | WinningTeamID | LosingTeamID | WinningPoints | LosingPoints | WinningFouls | ... | 
1          13             1              45                5               3  
1          12             4              21                12              4              

That is, the information are divided based on the game result: winning or losing. I would like to retrieve the data of each game for a specific team (e.g. 12).

即,根据游戏结果划分信息:输赢。我想为特定团队(例如 12)检索每场比赛的数据。

Day | Points | Fouls | ... | 
1       21       4     ...
2       32       6     ...

The simplest way is to scan the whole dataframe, check if a specific teamID is on WinningIDor LosingIDand then, based on that, retrieve the "Losing-columns" or the "Winning-columns". Is there a more "elegant" way of slicing the pandas dataframe? This will simply give me the subset of matches where the team 12 is involved.

最简单的方法是扫描整个数据帧,检查特定 teamID 是否在WinningIDLosingID 上,然后基于此检索“ Losing-columns”或“ Winning-columns”。有没有更“优雅”的方式来切片Pandas数据框?这将简单地为我提供涉及团队 12 的比赛的子集。

df[df[WinningTeamID == 12] | [LosingTeamID == 12]]

How can I filter those data and create the desired dataframe?

如何过滤这些数据并创建所需的数据框?

采纳答案by unutbu

Suppose we could choose the format of the data. What would be ideal? Since we want to collect stats per TeamID, ideally we would have a column of TeamIDs and separate columns for each stat including the outcome.

假设我们可以选择数据的格式。什么是理想的?由于我们想要收集 per 的统计数据TeamID,理想情况下我们会有一列TeamIDs 和每个统计数据的单独列,包括结果。

So the data would look like this:

所以数据看起来像这样:

| Day | Outcome | TeamID | Points | Fouls |
|   1 | Winning |     13 |     45 |     3 |
|   1 | Losing  |      1 |      5 |   NaN |
|   1 | Winning |     12 |     21 |     4 |
|   1 | Losing  |      4 |     12 |   NaN |

Here is how we can manipulate the given data into the desired form:

下面是我们如何将给定的数据处理成所需的形式:

import numpy as np
import pandas as pd

df = pd.DataFrame({'Day': [1, 1], 'LosingPoints': [5, 12], 'LosingTeamID': [1, 4], 'WinningFouls': [3, 4], 'WinningPoints': [45, 21], 'WinningTeamID': [13, 12]})
df = df.set_index(['Day'])
columns = df.columns.to_series().str.extract(r'^(Losing|Winning)?(.*)', expand=True)
columns = pd.MultiIndex.from_arrays([columns[col] for col in columns], 
                                    names=['Outcome', None])
df.columns = columns
df = df.stack(level='Outcome').reset_index()
print(df)

yields

产量

   Day  Outcome  Fouls  Points  TeamID
0    1   Losing    NaN       5       1
1    1  Winning    3.0      45      13
2    1   Losing    NaN      12       4
3    1  Winning    4.0      21      12

Now we can obtain all the stats about TeamID12 using

现在我们可以获得关于TeamID12 的所有统计数据

print(df.loc[df['TeamID']==12])
#    Day  Outcome  Fouls  Points  TeamID
# 3    1  Winning    4.0      21      12


df = df.set_index(['Day'])moves the Daycolumn into the index.

df = df.set_index(['Day'])Day列移动到索引中。

The purpose of placing Dayin the index is to "protect" it from manipulations (primarily the stackcall) that are intended only for columns labeled Losingor Winning. If you had other columns, such as Locationor Officialswhich, like Day, do not pertain to Losingor Winning, then you'd want to include them in the set_indexcall too: e.g. df = df.set_index(['Day', 'Location', 'Officials']).

放置Day在索引中的目的是“保护”它免受stack仅用于标记为Losing或 的列的操作(主要是调用)Winning。如果您有其他列,例如Locationor Officialswhich, like Day,不属于Losingor Winning,那么您也希望将它们包含在set_index调用中:例如 df = df.set_index(['Day', 'Location', 'Officials'])

Try commenting out df = df.set_index(['Day'])from the code above. Then step through the code line-by-line. In particular, compare what df.stack(level='Outcome')looks like with and without the set_indexcall:

尝试df = df.set_index(['Day'])从上面的代码中注释掉。然后逐行执行代码。特别是,比较df.stack(level='Outcome')有和没有set_index呼叫的情况:

With df = df.set_index(['Day']):

df = df.set_index(['Day'])

In [26]: df.stack(level='Outcome')
Out[26]: 
             Fouls  Points  TeamID
Day Outcome                       
1   Losing     NaN       5       1
    Winning    3.0      45      13
    Losing     NaN      12       4
    Winning    4.0      21      12

Without df = df.set_index(['Day']):

没有df = df.set_index(['Day'])

In [29]: df.stack(level='Outcome')
Out[29]: 
           Day  Fouls  Points  TeamID
  Outcome                            
0 NaN      1.0    3.0      45      13
  Losing   NaN    NaN       5       1
  Winning  1.0    3.0      45      13
1 NaN      1.0    4.0      21      12
  Losing   NaN    NaN      12       4
  Winning  1.0    4.0      21      12

Without the set_indexcall you end up with rows that you do not want -- the rows where Outcomeequals NaN.

如果没有set_index调用,您最终会得到不想要的行 -Outcome等于的行NaN



The purpose of

的目的

columns = df.columns.to_series().str.extract(r'^(Losing|Winning)?(.*)', expand=True)
columns = pd.MultiIndex.from_arrays([columns[col] for col in columns], 
                                    names=['Outcome', None])

is to create a multi-level column index (called a MultiIndex) which labels columns Losingor Winningas appropriate. Notice that by separating out the Losingor Winningparts of the labels, the remaining parts of the labels become duplicated.

是创建一个多级列索引(称为 MultiIndex)来标记列LosingWinning视情况而定。请注意,通过分离标签的LosingWinning部分,标签的其余部分将成为重复的。

We end up with a DataFrame, df, with two columns labeled "Points" for example. This allows Pandas to identify these columns as somehow similar.

我们最终得到一个 DataFrame,例如df,有两列标记为“点”。这允许 Pandas 将这些列标识为某种相似。

The big gain -- the reason why we went through the trouble of setting up the MultiIndex is so that these "similar" columns can be "unified" by calling df.stack:

最大的收获——我们之所以遇到设置 MultiIndex 的麻烦,是为了让这些“相似”的列可以通过调用来“统一” df.stack

In [47]: df
Out[47]: 
Outcome Losing        Winning              
        Points TeamID   Fouls Points TeamID
Day                                        
1            5      1       3     45     13
1           12      4       4     21     12

In [48]: df.stack(level="Outcome")
Out[48]: 
             Fouls  Points  TeamID
Day Outcome                       
1   Losing     NaN       5       1
    Winning    3.0      45      13
    Losing     NaN      12       4
    Winning    4.0      21      12


stack, unstack, set_indexand reset_indexare the 4 fundamental DataFrame reshaping operations.

stackunstackset_index并且reset_index是4个基本数据框中整形操作。

  • df.stackmoves a level (or levels) of the column index into the row index.
  • df.unstackmoves a level (or levels) of the row index into the column index.
  • df.set_indexmoves column values into the row index
  • df.reset_indexmoves a level (or levels) of the row index into a column of values
  • df.stack将列索引的一个(或多个)级别移动到行索引中。
  • df.unstack将行索引的一个(或多个)级别移动到列索引中。
  • df.set_index将列值移动到行索引中
  • df.reset_index将行索引的一个(或多个)级别移动到一列值中

Together, these 4 methods allow you to move data in your DataFrame anywhere you want -- in the columns, the row index or the column index.

总之,这 4 种方法允许您将 DataFrame 中的数据移动到您想要的任何位置——在列、行索引或列索引中。

The above code is an example of how to use these tools (well, three of the four) to reshape datainto a desired form.

上面的代码是如何使用这些工具(好吧,四个中的三个)将数据重塑为所需形式的示例。

回答by Ajay Ohri

df.query['WinningTeamID == 12 | LosingTeamID == 12']

回答by Niv Cohen

I think it should be more like:

我觉得应该更像:

df.query('columnX == 15 | columnY == 25')