Python 如何根据列值从 DataFrame 中选择行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17071871/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:21:40  来源:igfitidea点击:

How to select rows from a DataFrame based on column values?

pythonpandasdataframe

提问by szli

How to select rows from a DataFramebased on values in some column in Python Pandas?

如何DataFrame根据 Python Pandas 中某些列中的值从 a 中选择行?

In SQL, I would use:

在 SQL 中,我会使用:

SELECT *
FROM table
WHERE colume_name = some_value

I tried to look at pandas documentation but did not immediately find the answer.

我试图查看 Pandas 文档,但没有立即找到答案。

采纳答案by unutbu

To select rows whose column value equals a scalar, some_value, use ==:

要选择列值等于标量 的行some_value,请使用==

df.loc[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values, use isin:

要选择列值在可迭代中的行some_values,请使用isin

df.loc[df['column_name'].isin(some_values)]

Combine multiple conditions with &:

将多个条件与&

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

Note the parentheses. Due to Python's operator precedence rules, &binds more tightly than <=and >=. Thus, the parentheses in the last example are necessary. Without the parentheses

注意括号。由于 Python 的运算符优先级规则&绑定比<=and更紧密>=。因此,最后一个例子中的括号是必要的。没有括号

df['column_name'] >= A & df['column_name'] <= B

is parsed as

被解析为

df['column_name'] >= (A & df['column_name']) <= B

which results in a Truth value of a Series is ambiguous error.

这导致一个系列真值是不明确的错误



To select rows whose column value does not equalsome_value, use !=:

要选择列值不等于的行some_value,请使用!=

df.loc[df['column_name'] != some_value]

isinreturns a boolean Series, so to select rows whose value is notin some_values, negate the boolean Series using ~:

isin返回一个布尔系列,因此要选择值不在中的行,请some_values使用~以下方法否定布尔系列:

df.loc[~df['column_name'].isin(some_values)]


For example,

例如,

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
#      A      B  C   D
# 0  foo    one  0   0
# 1  bar    one  1   2
# 2  foo    two  2   4
# 3  bar  three  3   6
# 4  foo    two  4   8
# 5  bar    two  5  10
# 6  foo    one  6  12
# 7  foo  three  7  14

print(df.loc[df['A'] == 'foo'])

yields

产量

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14


If you have multiple values you want to include, put them in a list (or more generally, any iterable) and use isin:

如果您有多个要包含的值,请将它们放在一个列表中(或更一般地说,任何可迭代的)并使用isin

print(df.loc[df['B'].isin(['one','three'])])

yields

产量

     A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14


Note, however, that if you wish to do this many times, it is more efficient to make an index first, and then use df.loc:

但是请注意,如果您希望多次执行此操作,首先创建索引会更有效,然后使用df.loc

df = df.set_index(['B'])
print(df.loc['one'])

yields

产量

       A  C   D
B              
one  foo  0   0
one  bar  1   2
one  foo  6  12

or, to include multiple values from the index use df.index.isin:

或者,要包含索引中的多个值,请使用df.index.isin

df.loc[df.index.isin(['one','two'])]

yields

产量

       A  C   D
B              
one  foo  0   0
one  bar  1   2
two  foo  2   4
two  foo  4   8
two  bar  5  10
one  foo  6  12

回答by DataByDavid

Here is a simple example

这是一个简单的例子

from pandas import DataFrame

# Create data set
d = {'Revenue':[100,111,222], 
     'Cost':[333,444,555]}
df = DataFrame(d)


# mask = Return True when the value in column "Revenue" is equal to 111
mask = df['Revenue'] == 111

print mask

# Result:
# 0    False
# 1     True
# 2    False
# Name: Revenue, dtype: bool


# Select * FROM df WHERE Revenue = 111
df[mask]

# Result:
#    Cost    Revenue
# 1  444     111

回答by imolit

tl;dr

tl;博士

The pandas equivalent to

熊猫相当于

select * from table where column_name = some_value

is

table[table.column_name == some_value]

Multiple conditions:

多重条件:

table[(table.column_name == some_value) | (table.column_name2 == some_value2)]

or

或者

table.query('column_name == some_value | column_name2 == some_value2')

Code example

代码示例

import pandas as pd

# Create data set
d = {'foo':[100, 111, 222], 
     'bar':[333, 444, 555]}
df = pd.DataFrame(d)

# Full dataframe:
df

# Shows:
#    bar   foo 
# 0  333   100
# 1  444   111
# 2  555   222

# Output only the row(s) in df where foo is 222:
df[df.foo == 222]

# Shows:
#    bar  foo
# 2  555  222

In the above code it is the line df[df.foo == 222]that gives the rows based on the column value, 222in this case.

在上面的代码中df[df.foo == 222]222在这种情况下,它是根据列值给出行的行。

Multiple conditions are also possible:

多个条件也是可能的:

df[(df.foo == 222) | (df.bar == 444)]
#    bar  foo
# 1  444  111
# 2  555  222

But at that point I would recommend using the queryfunction, since it's less verbose and yields the same result:

但那时我会推荐使用query函数,因为它不那么冗长并且产生相同的结果:

df.query('foo == 222 | bar == 444')

回答by fredcallaway

I find the syntax of the previous answers to be redundant and difficult to remember. Pandas introduced the query()method in v0.13 and I much prefer it. For your question, you could do df.query('col == val')

我发现以前答案的语法是多余的并且难以记住。Pandasquery()在 v0.13 中引入了该方法,我更喜欢它。对于你的问题,你可以这样做df.query('col == val')

Reproduced from http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query

转载自http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query

In [167]: n = 10

In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [169]: df
Out[169]: 
          a         b         c
0  0.687704  0.582314  0.281645
1  0.250846  0.610021  0.420121
2  0.624328  0.401816  0.932146
3  0.011763  0.022921  0.244186
4  0.590198  0.325680  0.890392
5  0.598892  0.296424  0.007312
6  0.634625  0.803069  0.123872
7  0.924168  0.325076  0.303746
8  0.116822  0.364564  0.454607
9  0.986142  0.751953  0.561512

# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
Out[170]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

# query
In [171]: df.query('(a < b) & (b < c)')
Out[171]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

You can also access variables in the environment by prepending an @.

您还可以通过在@.

exclude = ('red', 'orange')
df.query('color not in @exclude')

回答by TuanDT

To append to this famous question (though a bit too late): You can also do df.groupby('column_name').get_group('column_desired_value').reset_index()to make a new data frame with specified column having a particular value. E.g.

附加到这个著名的问题(虽然有点晚):您还可以df.groupby('column_name').get_group('column_desired_value').reset_index()使用具有特定值的指定列来制作新数据框。例如

import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)

b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1) 
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)

Run this gives:

运行这个给出:

Original dataframe:
     A      B
0  foo    one
1  bar    one
2  foo    two
3  bar  three
4  foo    two
5  bar    two
6  foo    one
7  foo  three
Sub dataframe where B is two:
     A    B
0  foo  two
1  foo  two
2  bar  two

回答by shivsn

Faster results can be achieved using numpy.where.

使用numpy.where 可以获得更快的结果。

For example, with unubtu's setup-

例如,使用unubtu 的设置-

In [76]: df.iloc[np.where(df.A.values=='foo')]
Out[76]: 
     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

Timing comparisons:

时序比较:

In [68]: %timeit df.iloc[np.where(df.A.values=='foo')]  # fastest
1000 loops, best of 3: 380 μs per loop

In [69]: %timeit df.loc[df['A'] == 'foo']
1000 loops, best of 3: 745 μs per loop

In [71]: %timeit df.loc[df['A'].isin(['foo'])]
1000 loops, best of 3: 562 μs per loop

In [72]: %timeit df[df.A=='foo']
1000 loops, best of 3: 796 μs per loop

In [74]: %timeit df.query('(A=="foo")')  # slowest
1000 loops, best of 3: 1.71 ms per loop

回答by piRSquared

There are several ways to select rows from a pandas data frame:

有几种方法可以从 Pandas 数据框中选择行:

  1. Boolean indexing (df[df['col'] == value] )
  2. Positional indexing (df.iloc[...])
  3. Label indexing (df.xs(...))
  4. df.query(...)API
  1. 布尔索引 ( df[df['col'] == value] )
  2. 位置索引 ( df.iloc[...])
  3. 标签索引 ( df.xs(...))
  4. df.query(...)应用程序接口

Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column 'A'== 'foo'

下面我将向您展示每个示例,以及何时使用某些技术的建议。假设我们的标准是 column 'A'=='foo'

(Note on performance: For each base type, we can keep things simple by using the pandas API or we can venture outside the API, usually into numpy, and speed things up.)

(关于性能的注意事项:对于每个基本类型,我们可以通过使用 pandas API 来保持简单,或者我们可以在 API 之外冒险,通常进入numpy,并加快速度。)



Setup
The first thing we'll need is to identify a condition that will act as our criterion for selecting rows. We'll start with the OP's case column_name == some_value, and include some other common use cases.

设置
我们需要做的第一件事是确定一个条件,该条件将作为我们选择行的标准。我们将从 OP 的案例开始column_name == some_value,并包括一些其他常见用例。

Borrowing from @unutbu:

借用@unutbu:

import pandas as pd, numpy as np

df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})


1. Boolean indexing

1.布尔索引

... Boolean indexing requires finding the true value of each row's 'A'column being equal to 'foo', then using those truth values to identify which rows to keep. Typically, we'd name this series, an array of truth values, mask. We'll do so here as well.

... 布尔索引需要找到每行的'A'列的真值等于'foo',然后使用这些真值来确定要保留哪些行。通常,我们将这个系列命名为一个真值数组,mask. 我们也会在这里这样做。

mask = df['A'] == 'foo'

We can then use this mask to slice or index the data frame

然后我们可以使用此掩码对数据框进行切片或索引

df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn't an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the mask.

这是完成此任务的最简单方法之一,如果性能或直观性不是问题,这应该是您选择的方法。但是,如果性能是一个问题,那么您可能需要考虑另一种创建mask.



2. Positional indexing

2. 位置索引

Positional indexing (df.iloc[...]) has its use cases, but this isn't one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.

位置索引 ( df.iloc[...]) 有其用例,但这不是其中之一。为了确定切片的位置,我们首先需要执行与上面相同的布尔分析。这让我们执行一个额外的步骤来完成相同的任务。

mask = df['A'] == 'foo'
pos = np.flatnonzero(mask)
df.iloc[pos]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

3. Label indexing

3. 标签索引

Labelindexing can be very handy, but in this case, we are again doing more work for no benefit

标签索引可能非常方便,但在这种情况下,我们再次做更多的工作而没有任何好处

df.set_index('A', append=True, drop=False).xs('foo', level=1)

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

4. df.query()API

4. df.query()API

pd.DataFrame.queryis a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.

pd.DataFrame.query是执行此任务的一种非常优雅/直观的方式,但通常速度较慢。但是,如果您注意下面的时序,对于大数据,查询是非常有效的。比标准方法更重要,并且与我的最佳建议类似。

df.query('A == "foo"')

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14


My preference is to use the Booleanmask

我的偏好是使用 Booleanmask

Actual improvements can be made by modifying how we create our Booleanmask.

可以通过修改我们创建Booleanmask.

maskalternative 1
Use the underlying numpyarray and forgo the overhead of creating another pd.Series

mask替代方案 1
使用底层numpy数组并放弃创建另一个数组的开销pd.Series

mask = df['A'].values == 'foo'

I'll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame. First, we look at the difference in creating the mask

我将在最后展示更完整的时间测试,但只需看看我们使用示例数据框获得的性能提升。首先,我们看一下在创建mask

%timeit mask = df['A'].values == 'foo'
%timeit mask = df['A'] == 'foo'

5.84 μs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 μs ± 4.45 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Evaluating the maskwith the numpyarray is ~ 30 times faster. This is partly due to numpyevaluation often being faster. It is also partly due to the lack of overhead necessary to build an index and a corresponding pd.Seriesobject.

mask使用numpy阵列评估 的速度提高了约 30 倍。这部分是由于numpy评估通常更快。部分原因还在于缺乏构建索引和相应pd.Series对象所需的开销。

Next, we'll look at the timing for slicing with one maskversus the other.

接下来,我们将看看切片与另一个切片的时间mask

mask = df['A'].values == 'foo'
%timeit df[mask]
mask = df['A'] == 'foo'
%timeit df[mask]

219 μs ± 12.3 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 μs ± 7.03 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The performance gains aren't as pronounced. We'll see if this holds up over more robust testing.

性能提升没有那么明显。我们将看看这是否能通过更强大的测试。



maskalternative 2
We could have reconstructed the data frame as well. There is a big caveat when reconstructing a dataframe—you must take care of the dtypeswhen doing so!

mask备选方案 2
我们也可以重建数据帧。重建数据帧时有一个很大的警告——dtypes在这样做时你必须注意!

Instead of df[mask]we will do this

而不是df[mask]我们会这样做

pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

If the data frame is of mixed type, which our example is, then when we get df.valuesthe resulting array is of dtypeobjectand consequently, all columns of the new data frame will be of dtypeobject. Thus requiring the astype(df.dtypes)and killing any potential performance gains.

如果数据框是混合类型,我们的例子就是这样,那么当我们得到df.values结果数组dtypeobject时,新数据框的所有列都将是dtypeobject。因此需要astype(df.dtypes)并杀死任何潜在的性能提升。

%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

216 μs ± 10.4 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

However, if the data frame is not of mixed type, this is a very useful way to do it.

但是,如果数据框不是混合类型,这是一种非常有用的方法。

Given

给定的

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))

d1

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5    


%%timeit
mask = d1['A'].values == 7
d1[mask]

179 μs ± 8.73 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Versus

相对

%%timeit
mask = d1['A'].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)

87 μs ± 5.12 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We cut the time in half.

我们把时间减半。



maskalternative 3
@unutbu also shows us how to use pd.Series.isinto account for each element of df['A']being in a set of values. This evaluates to the same thing if our set of values is a set of one value, namely 'foo'. But it also generalizes to include larger sets of values if needed. Turns out, this is still pretty fast even though it is a more general solution. The only real loss is in intuitiveness for those not familiar with the concept.

mask替代方案 3
@unutbu 还向我们展示了如何使用pd.Series.isin来说明df['A']存在于一组值中的每个元素。如果我们的一组值是一组一个值,即 ,则计算结果相同'foo'。但如果需要,它也可以概括为包含更大的值集。事实证明,即使它是一个更通用的解决方案,这仍然相当快。对于那些不熟悉这个概念的人来说,唯一真正的损失是直观性。

mask = df['A'].isin(['foo'])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

However, as before, we can utilize numpyto improve performance while sacrificing virtually nothing. We'll use np.in1d

然而,和以前一样,我们可以利用numpy来提高性能,同时几乎不牺牲任何东西。我们会用np.in1d

mask = np.in1d(df['A'].values, ['foo'])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14


Timing
I'll include other concepts mentioned in other posts as well for reference.
Code Below

时间安排
我将包括其他帖子中提到的其他概念以供参考。
下面的代码

Each Column in this table represents a different length data frame over which we test each function. Each column shows relative time taken, with the fastest function given a base index of 1.0.

此表中的每一列代表一个不同长度的数据帧,我们在其上测试每个函数。每列显示所用的相对时间,最快的函数的基本索引为1.0

res.div(res.min())

                         10        30        100       300       1000      3000      10000     30000
mask_standard         2.156872  1.850663  2.034149  2.166312  2.164541  3.090372  2.981326  3.131151
mask_standard_loc     1.879035  1.782366  1.988823  2.338112  2.361391  3.036131  2.998112  2.990103
mask_with_values      1.010166  1.000000  1.005113  1.026363  1.028698  1.293741  1.007824  1.016919
mask_with_values_loc  1.196843  1.300228  1.000000  1.000000  1.038989  1.219233  1.037020  1.000000
query                 4.997304  4.765554  5.934096  4.500559  2.997924  2.397013  1.680447  1.398190
xs_label              4.124597  4.272363  5.596152  4.295331  4.676591  5.710680  6.032809  8.950255
mask_with_isin        1.674055  1.679935  1.847972  1.724183  1.345111  1.405231  1.253554  1.264760
mask_with_in1d        1.000000  1.083807  1.220493  1.101929  1.000000  1.000000  1.000000  1.144175

You'll notice that fastest times seem to be shared between mask_with_valuesand mask_with_in1d

您会注意到最快的时间似乎在mask_with_valuesmask_with_in1d

res.T.plot(loglog=True)

enter image description here

在此处输入图片说明

Functions

职能

def mask_standard(df):
    mask = df['A'] == 'foo'
    return df[mask]

def mask_standard_loc(df):
    mask = df['A'] == 'foo'
    return df.loc[mask]

def mask_with_values(df):
    mask = df['A'].values == 'foo'
    return df[mask]

def mask_with_values_loc(df):
    mask = df['A'].values == 'foo'
    return df.loc[mask]

def query(df):
    return df.query('A == "foo"')

def xs_label(df):
    return df.set_index('A', append=True, drop=False).xs('foo', level=-1)

def mask_with_isin(df):
    mask = df['A'].isin(['foo'])
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df['A'].values, ['foo'])
    return df[mask]


Testing

测试

res = pd.DataFrame(
    index=[
        'mask_standard', 'mask_standard_loc', 'mask_with_values', 'mask_with_values_loc',
        'query', 'xs_label', 'mask_with_isin', 'mask_with_in1d'
    ],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

for j in res.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in res.index:a
        stmt = '{}(d)'.format(i)
        setp = 'from __main__ import d, {}'.format(i)
        res.at[i, j] = timeit(stmt, setp, number=50)


Special Timing
Looking at the special case when we have a single non-object dtypefor the entire data frame. Code Below

特殊时间
查看dtype整个数据帧只有一个非对象时的特殊情况。 下面的代码

spec.div(spec.min())

                     10        30        100       300       1000      3000      10000     30000
mask_with_values  1.009030  1.000000  1.194276  1.000000  1.236892  1.095343  1.000000  1.000000
mask_with_in1d    1.104638  1.094524  1.156930  1.072094  1.000000  1.000000  1.040043  1.027100
reconstruct       1.000000  1.142838  1.000000  1.355440  1.650270  2.222181  2.294913  3.406735

Turns out, reconstruction isn't worth it past a few hundred rows.

事实证明,重建不值得超过几百行。

spec.T.plot(loglog=True)

enter image description here

在此处输入图片说明

Functions

职能

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))

def mask_with_values(df):
    mask = df['A'].values == 'foo'
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df['A'].values, ['foo'])
    return df[mask]

def reconstruct(df):
    v = df.values
    mask = np.in1d(df['A'].values, ['foo'])
    return pd.DataFrame(v[mask], df.index[mask], df.columns)

spec = pd.DataFrame(
    index=['mask_with_values', 'mask_with_in1d', 'reconstruct'],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

Testing

测试

for j in spec.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in spec.index:
        stmt = '{}(d)'.format(i)
        setp = 'from __main__ import d, {}'.format(i)
        spec.at[i, j] = timeit(stmt, setp, number=50)

回答by SP001

For selecting only specific columns out of multiple columns for a given value in pandas:

对于 Pandas 中的给定值,仅从多列中选择特定列:

select col_name1, col_name2 from table where column_name = some_value.

Options:

选项:

df.loc[df['column_name'] == some_value][[col_name1, col_name2]]

or

或者

df.query['column_name' == 'some_value'][[col_name1, col_name2]]

回答by Vahidn

You can also use .apply:

你也可以使用 .apply:

df.apply(lambda row: row[df['B'].isin(['one','three'])])

It actually works row-wise (i.e., applies the function to each row).

它实际上是按行工作的(即,将函数应用于每一行)。

The output is

输出是

   A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

The results is the same as using as mentioned by @unutbu

结果与@unutbu 提到的使用相同

df[[df['B'].isin(['one','three'])]]

回答by Erfan

More flexibility using .querywith pandas >= 0.25.0:

使用.querywith更灵活pandas >= 0.25.0

August 2019 updated answer

2019 年 8 月更新的答案

Since pandas >= 0.25.0we can use the querymethod to filter dataframes with pandas methods and even column names which have spaces. Normally the spaces in column names would give an error, but now we can solve that using a backtick (`) see GitHub:

因为pandas >= 0.25.0我们可以使用该query方法来过滤带有pandas 方法的数据帧,甚至是带有空格的列名。通常列名中的空格会出错,但现在我们可以使用反引号 (`) 解决该问题,请参阅GitHub

# Example dataframe
df = pd.DataFrame({'Sender email':['[email protected]', "[email protected]", "[email protected]"]})

     Sender email
0  [email protected]
1  [email protected]
2    [email protected]

Using .querywith method str.endswith:

使用.query方法str.endswith

df.query('`Sender email`.str.endswith("@shop.com")')

Output

输出

     Sender email
1  [email protected]
2    [email protected]


Also we can use local variables by prefixing it with an @in our query:

我们也可以通过@在查询中加上前缀来使用局部变量:

domain = 'shop.com'
df.query('`Sender email`.str.endswith(@domain)')

Output

输出

     Sender email
1  [email protected]
2    [email protected]