Python 你如何按多列过滤熊猫数据框

Question

提问by yoshiserry

To filter a dataframe (df) by a single column, if we consider data with male and females we might:

要按单列过滤数据框 (df)，如果我们考虑包含男性和女性的数据，我们可能会：

males = df[df[Gender]=='Male']

Question 1 - But what if the data spanned multiple years and i wanted to only see males for 2014?

问题 1 - 但是如果数据跨越多年并且我只想看到 2014 年的男性呢？

In other languages I might do something like:

在其他语言中，我可能会执行以下操作：

if A = "Male" and if B = "2014" then

(except I want to do this and get a subset of the original dataframe in a new dataframe object)

（除非我想这样做并在新数据帧对象中获取原始数据帧的子集）

Question 2. How do I do this in a loop, and create a dataframe object for each unique sets of year and gender (i.e. a df for: 2013-Male, 2013-Female, 2014-Male, and 2014-Female

问题 2. 我如何在循环中执行此操作，并为每个独特的年份和性别集创建一个数据框对象（即 df 用于：2013-Male、2013-Female、2014-Male 和 2014-Female

for y in year:

for g in gender:

df = .....

Answer 1

采纳答案by zhangxaochen

Using &operator, don't forget to wrap the sub-statements with ():

使用&运算符，不要忘记用以下内容包装子语句()：

males = df[(df[Gender]=='Male') & (df[Year]==2014)]

To store your dataframes in a dictusing a for loop:

要dict使用 for 循环将数据帧存储在 a 中：

from collections import defaultdict
dic={}
for g in ['male', 'female']:
  dic[g]=defaultdict(dict)
  for y in [2013, 2014]:
    dic[g][y]=df[(df[Gender]==g) & (df[Year]==y)] #store the DataFrames to a dict of dict

EDIT:

编辑：

A demo for your getDF:

您的演示getDF：

def getDF(dic, gender, year):
  return dic[gender][year]

print genDF(dic, 'male', 2014)

Answer 2

回答by guibor

For more general boolean functions that you would like to use as a filter and that depend on more than one column, you can use:

对于您想用作过滤器并且依赖于多个列的更通用的布尔函数，您可以使用：

df = df[df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)]

where f is a function that is applied to every pair of elements (x1, x2) from col_1 and col_2 and returns True or False depending on any condition you want on (x1, x2).

其中 f 是一个函数，该函数应用于 col_1 和 col_2 中的每对元素 (x1, x2)，并根据您想要的 (x1, x2) 条件返回 True 或 False。

Answer 3

回答by redreamality

Start from pandas 0.13, this is the most efficient way.

从pandas 0.13开始，这是最有效的方式。

df.query('Gender=="Male" & Year=="2014" ')

Answer 4

回答by Tom Bug

You can filter by multiple columns (more than two) by using the np.logical_andoperator to replace &(or np.logical_orto replace |)

您可以通过使用np.logical_and运算符替换&（或np.logical_or替换|）按多列（多于两列）进行过滤

Here's an example function that does the job, if you provide target values for multiple fields. You can adapt it for different types of filtering and whatnot:

如果您为多个字段提供目标值，那么这是一个完成这项工作的示例函数。您可以针对不同类型的过滤等进行调整：

def filter_df(df, filter_values):
    """Filter df by matching targets for multiple columns.

    Args:
        df (pd.DataFrame): dataframe
        filter_values (None or dict): Dictionary of the form:
                `{<field>: <target_values_list>}`
            used to filter columns data.
    """
    import numpy as np
    if filter_values is None or not filter_values:
        return df
    return df[
        np.logical_and.reduce([
            df[column].isin(target_values) 
            for column, target_values in filter_values.items()
        ])
    ]

Usage:

用法：

df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [1, 2, 3, 4]})

filter_df(df, {
    'a': [1, 2, 3],
    'b': [1, 2, 4]
})

Answer 5

回答by Bouncner

In case somebody wonders what is the faster way to filter (the accepted answer or the one from @redreamality):

如果有人想知道什么是更快的过滤方式（接受的答案或来自@redreamality 的答案）：

import pandas as pd
import numpy as np

length = 100_000
df = pd.DataFrame()
df['Year'] = np.random.randint(1950, 2019, size=length)
df['Gender'] = np.random.choice(['Male', 'Female'], length)

%timeit df.query('Gender=="Male" & Year=="2014" ')
%timeit df[(df['Gender']=='Male') & (df['Year']==2014)]

Results for 100,000 rows:

100,000 行的结果：

6.67 ms ± 557 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.54 ms ± 536 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Results for 10,000,000 rows:

10,000,000 行的结果：

326 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
472 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So results depend on the size and the data. On my laptop, query()gets faster after 500k rows. Further, the string search in Year=="2014"has an unnecessary overhead (Year==2014is faster).

所以结果取决于大小和数据。在我的笔记本电脑上，50 万query()行后变得更快。此外，字符串搜索Year=="2014"有不必要的开销（Year==2014更快）。

Answer 6

回答by Alex

You can create your own filter function using queryin pandas. Here you have filtering of dfresults by all the kwargsparameters. Dont' forgot to add some validators(kwargsfiltering) to get filter function for your own df.

您可以使用queryin创建自己的过滤器函数pandas。在这里，您可以df按所有kwargs参数过滤结果。不要忘记添加一些验证器（kwargs过滤）以获取您自己的过滤功能df。

def filter(df, **kwargs):
    query_list = []
    for key in kwargs.keys():
        query_list.append(f'{key}=="{kwargs[key]}"')
    query = ' & '.join(query_list)
    return df.query(query)

Python 你如何按多列过滤熊猫数据框

提问by yoshiserry

采纳答案by zhangxaochen

EDIT:

编辑：

回答by guibor

回答by redreamality

回答by Tom Bug

回答by Bouncner

回答by Alex

相关推荐

最近更新

标签

Python 你如何按多列过滤熊猫数据框

提问by yoshiserry

采纳答案by zhangxaochen

EDIT:

编辑：

回答by guibor

回答by redreamality

回答by Tom Bug

回答by Bouncner

回答by Alex

相关推荐

Python 使用 pandas 和 matplotlib.pyplot 创建图例

用于 datetime.strptime() 的 Python 时区“%z”指令不可用

Python 素数分解 - 列表

Python 类型错误：% 不支持的操作数类型：'NoneType' 和 'int'

相关推荐

最近更新

标签