Python 如何根据列值从 DataFrame 中选择行?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17071871/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to select rows from a DataFrame based on column values?
提问by szli
How to select rows from a DataFramebased on values in some column in Python Pandas?
如何DataFrame根据 Python Pandas 中某些列中的值从 a 中选择行?
In SQL, I would use:
在 SQL 中,我会使用:
SELECT *
FROM table
WHERE colume_name = some_value
I tried to look at pandas documentation but did not immediately find the answer.
我试图查看 Pandas 文档,但没有立即找到答案。
采纳答案by unutbu
To select rows whose column value equals a scalar, some_value, use ==:
要选择列值等于标量 的行some_value,请使用==:
df.loc[df['column_name'] == some_value]
To select rows whose column value is in an iterable, some_values, use isin:
要选择列值在可迭代中的行some_values,请使用isin:
df.loc[df['column_name'].isin(some_values)]
Combine multiple conditions with &:
将多个条件与&:
df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
Note the parentheses. Due to Python's operator precedence rules, &binds more tightly than <=and >=. Thus, the parentheses in the last example are necessary. Without the parentheses
注意括号。由于 Python 的运算符优先级规则,&绑定比<=and更紧密>=。因此,最后一个例子中的括号是必要的。没有括号
df['column_name'] >= A & df['column_name'] <= B
is parsed as
被解析为
df['column_name'] >= (A & df['column_name']) <= B
which results in a Truth value of a Series is ambiguous error.
To select rows whose column value does not equalsome_value, use !=:
要选择列值不等于的行some_value,请使用!=:
df.loc[df['column_name'] != some_value]
isinreturns a boolean Series, so to select rows whose value is notin some_values, negate the boolean Series using ~:
isin返回一个布尔系列,因此要选择值不在中的行,请some_values使用~以下方法否定布尔系列:
df.loc[~df['column_name'].isin(some_values)]
For example,
例如,
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print(df.loc[df['A'] == 'foo'])
yields
产量
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
If you have multiple values you want to include, put them in a
list (or more generally, any iterable) and use isin:
如果您有多个要包含的值,请将它们放在一个列表中(或更一般地说,任何可迭代的)并使用isin:
print(df.loc[df['B'].isin(['one','three'])])
yields
产量
A B C D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
Note, however, that if you wish to do this many times, it is more efficient to
make an index first, and then use df.loc:
但是请注意,如果您希望多次执行此操作,首先创建索引会更有效,然后使用df.loc:
df = df.set_index(['B'])
print(df.loc['one'])
yields
产量
A C D
B
one foo 0 0
one bar 1 2
one foo 6 12
or, to include multiple values from the index use df.index.isin:
或者,要包含索引中的多个值,请使用df.index.isin:
df.loc[df.index.isin(['one','two'])]
yields
产量
A C D
B
one foo 0 0
one bar 1 2
two foo 2 4
two foo 4 8
two bar 5 10
one foo 6 12
回答by DataByDavid
Here is a simple example
这是一个简单的例子
from pandas import DataFrame
# Create data set
d = {'Revenue':[100,111,222],
'Cost':[333,444,555]}
df = DataFrame(d)
# mask = Return True when the value in column "Revenue" is equal to 111
mask = df['Revenue'] == 111
print mask
# Result:
# 0 False
# 1 True
# 2 False
# Name: Revenue, dtype: bool
# Select * FROM df WHERE Revenue = 111
df[mask]
# Result:
# Cost Revenue
# 1 444 111
回答by imolit
tl;dr
tl;博士
The pandas equivalent to
熊猫相当于
select * from table where column_name = some_value
is
是
table[table.column_name == some_value]
Multiple conditions:
多重条件:
table[(table.column_name == some_value) | (table.column_name2 == some_value2)]
or
或者
table.query('column_name == some_value | column_name2 == some_value2')
Code example
代码示例
import pandas as pd
# Create data set
d = {'foo':[100, 111, 222],
'bar':[333, 444, 555]}
df = pd.DataFrame(d)
# Full dataframe:
df
# Shows:
# bar foo
# 0 333 100
# 1 444 111
# 2 555 222
# Output only the row(s) in df where foo is 222:
df[df.foo == 222]
# Shows:
# bar foo
# 2 555 222
In the above code it is the line df[df.foo == 222]that gives the rows based on the column value, 222in this case.
在上面的代码中df[df.foo == 222],222在这种情况下,它是根据列值给出行的行。
Multiple conditions are also possible:
多个条件也是可能的:
df[(df.foo == 222) | (df.bar == 444)]
# bar foo
# 1 444 111
# 2 555 222
But at that point I would recommend using the queryfunction, since it's less verbose and yields the same result:
但那时我会推荐使用query函数,因为它不那么冗长并且产生相同的结果:
df.query('foo == 222 | bar == 444')
回答by fredcallaway
I find the syntax of the previous answers to be redundant and difficult to remember. Pandas introduced the query()method in v0.13 and I much prefer it. For your question, you could do df.query('col == val')
我发现以前答案的语法是多余的并且难以记住。Pandasquery()在 v0.13 中引入了该方法,我更喜欢它。对于你的问题,你可以这样做df.query('col == val')
Reproduced from http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query
转载自http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query
In [167]: n = 10
In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [169]: df
Out[169]:
a b c
0 0.687704 0.582314 0.281645
1 0.250846 0.610021 0.420121
2 0.624328 0.401816 0.932146
3 0.011763 0.022921 0.244186
4 0.590198 0.325680 0.890392
5 0.598892 0.296424 0.007312
6 0.634625 0.803069 0.123872
7 0.924168 0.325076 0.303746
8 0.116822 0.364564 0.454607
9 0.986142 0.751953 0.561512
# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
Out[170]:
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
# query
In [171]: df.query('(a < b) & (b < c)')
Out[171]:
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
You can also access variables in the environment by prepending an @.
您还可以通过在@.
exclude = ('red', 'orange')
df.query('color not in @exclude')
回答by TuanDT
To append to this famous question (though a bit too late): You can also do df.groupby('column_name').get_group('column_desired_value').reset_index()to make a new data frame with specified column having a particular value. E.g.
附加到这个著名的问题(虽然有点晚):您还可以df.groupby('column_name').get_group('column_desired_value').reset_index()使用具有特定值的指定列来制作新数据框。例如
import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)
b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1)
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)
Run this gives:
运行这个给出:
Original dataframe:
A B
0 foo one
1 bar one
2 foo two
3 bar three
4 foo two
5 bar two
6 foo one
7 foo three
Sub dataframe where B is two:
A B
0 foo two
1 foo two
2 bar two
回答by shivsn
Faster results can be achieved using numpy.where.
使用numpy.where 可以获得更快的结果。
For example, with unubtu's setup-
例如,使用unubtu 的设置-
In [76]: df.iloc[np.where(df.A.values=='foo')]
Out[76]:
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
Timing comparisons:
时序比较:
In [68]: %timeit df.iloc[np.where(df.A.values=='foo')] # fastest
1000 loops, best of 3: 380 μs per loop
In [69]: %timeit df.loc[df['A'] == 'foo']
1000 loops, best of 3: 745 μs per loop
In [71]: %timeit df.loc[df['A'].isin(['foo'])]
1000 loops, best of 3: 562 μs per loop
In [72]: %timeit df[df.A=='foo']
1000 loops, best of 3: 796 μs per loop
In [74]: %timeit df.query('(A=="foo")') # slowest
1000 loops, best of 3: 1.71 ms per loop
回答by piRSquared
There are several ways to select rows from a pandas data frame:
有几种方法可以从 Pandas 数据框中选择行:
- Boolean indexing (
df[df['col'] == value] ) - Positional indexing (
df.iloc[...]) - Label indexing (
df.xs(...)) df.query(...)API
- 布尔索引 (
df[df['col'] == value] ) - 位置索引 (
df.iloc[...]) - 标签索引 (
df.xs(...)) df.query(...)应用程序接口
Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column 'A'== 'foo'
下面我将向您展示每个示例,以及何时使用某些技术的建议。假设我们的标准是 column 'A'=='foo'
(Note on performance: For each base type, we can keep things simple by using the pandas API or we can venture outside the API, usually into numpy, and speed things up.)
(关于性能的注意事项:对于每个基本类型,我们可以通过使用 pandas API 来保持简单,或者我们可以在 API 之外冒险,通常进入numpy,并加快速度。)
Setup
The first thing we'll need is to identify a condition that will act as our criterion for selecting rows. We'll start with the OP's case column_name == some_value, and include some other common use cases.
设置
我们需要做的第一件事是确定一个条件,该条件将作为我们选择行的标准。我们将从 OP 的案例开始column_name == some_value,并包括一些其他常见用例。
Borrowing from @unutbu:
借用@unutbu:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
1. Boolean indexing
1.布尔索引
... Boolean indexing requires finding the true value of each row's 'A'column being equal to 'foo', then using those truth values to identify which rows to keep. Typically, we'd name this series, an array of truth values, mask. We'll do so here as well.
... 布尔索引需要找到每行的'A'列的真值等于'foo',然后使用这些真值来确定要保留哪些行。通常,我们将这个系列命名为一个真值数组,mask. 我们也会在这里这样做。
mask = df['A'] == 'foo'
We can then use this mask to slice or index the data frame
然后我们可以使用此掩码对数据框进行切片或索引
df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn't an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the mask.
这是完成此任务的最简单方法之一,如果性能或直观性不是问题,这应该是您选择的方法。但是,如果性能是一个问题,那么您可能需要考虑另一种创建mask.
2. Positional indexing
2. 位置索引
Positional indexing (df.iloc[...]) has its use cases, but this isn't one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.
位置索引 ( df.iloc[...]) 有其用例,但这不是其中之一。为了确定切片的位置,我们首先需要执行与上面相同的布尔分析。这让我们执行一个额外的步骤来完成相同的任务。
mask = df['A'] == 'foo'
pos = np.flatnonzero(mask)
df.iloc[pos]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
3. Label indexing
3. 标签索引
Labelindexing can be very handy, but in this case, we are again doing more work for no benefit
标签索引可能非常方便,但在这种情况下,我们再次做更多的工作而没有任何好处
df.set_index('A', append=True, drop=False).xs('foo', level=1)
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
4. df.query()API
4. df.query()API
pd.DataFrame.queryis a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.
pd.DataFrame.query是执行此任务的一种非常优雅/直观的方式,但通常速度较慢。但是,如果您注意下面的时序,对于大数据,查询是非常有效的。比标准方法更重要,并且与我的最佳建议类似。
df.query('A == "foo"')
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
My preference is to use the Booleanmask
我的偏好是使用 Booleanmask
Actual improvements can be made by modifying how we create our Booleanmask.
可以通过修改我们创建Booleanmask.
maskalternative 1
Use the underlying numpyarray and forgo the overhead of creating another pd.Series
mask替代方案 1
使用底层numpy数组并放弃创建另一个数组的开销pd.Series
mask = df['A'].values == 'foo'
I'll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame. First, we look at the difference in creating the mask
我将在最后展示更完整的时间测试,但只需看看我们使用示例数据框获得的性能提升。首先,我们看一下在创建mask
%timeit mask = df['A'].values == 'foo'
%timeit mask = df['A'] == 'foo'
5.84 μs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 μs ± 4.45 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Evaluating the maskwith the numpyarray is ~ 30 times faster. This is partly due to numpyevaluation often being faster. It is also partly due to the lack of overhead necessary to build an index and a corresponding pd.Seriesobject.
mask使用numpy阵列评估 的速度提高了约 30 倍。这部分是由于numpy评估通常更快。部分原因还在于缺乏构建索引和相应pd.Series对象所需的开销。
Next, we'll look at the timing for slicing with one maskversus the other.
接下来,我们将看看切片与另一个切片的时间mask。
mask = df['A'].values == 'foo'
%timeit df[mask]
mask = df['A'] == 'foo'
%timeit df[mask]
219 μs ± 12.3 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 μs ± 7.03 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The performance gains aren't as pronounced. We'll see if this holds up over more robust testing.
性能提升没有那么明显。我们将看看这是否能通过更强大的测试。
maskalternative 2
We could have reconstructed the data frame as well. There is a big caveat when reconstructing a dataframe—you must take care of the dtypeswhen doing so!
mask备选方案 2
我们也可以重建数据帧。重建数据帧时有一个很大的警告——dtypes在这样做时你必须注意!
Instead of df[mask]we will do this
而不是df[mask]我们会这样做
pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
If the data frame is of mixed type, which our example is, then when we get df.valuesthe resulting array is of dtypeobjectand consequently, all columns of the new data frame will be of dtypeobject. Thus requiring the astype(df.dtypes)and killing any potential performance gains.
如果数据框是混合类型,我们的例子就是这样,那么当我们得到df.values结果数组dtypeobject时,新数据框的所有列都将是dtypeobject。因此需要astype(df.dtypes)并杀死任何潜在的性能提升。
%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
216 μs ± 10.4 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
However, if the data frame is not of mixed type, this is a very useful way to do it.
但是,如果数据框不是混合类型,这是一种非常有用的方法。
Given
给定的
np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))
d1
A B C D E
0 0 2 7 3 8
1 7 0 6 8 6
2 0 2 0 4 9
3 7 3 2 4 3
4 3 6 7 7 4
5 5 3 7 5 9
6 8 7 6 4 7
7 6 2 6 6 5
8 2 8 7 5 8
9 4 7 6 1 5
%%timeit
mask = d1['A'].values == 7
d1[mask]
179 μs ± 8.73 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Versus
相对
%%timeit
mask = d1['A'].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)
87 μs ± 5.12 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
We cut the time in half.
我们把时间减半。
maskalternative 3
@unutbu also shows us how to use pd.Series.isinto account for each element of df['A']being in a set of values. This evaluates to the same thing if our set of values is a set of one value, namely 'foo'. But it also generalizes to include larger sets of values if needed. Turns out, this is still pretty fast even though it is a more general solution. The only real loss is in intuitiveness for those not familiar with the concept.
mask替代方案 3
@unutbu 还向我们展示了如何使用pd.Series.isin来说明df['A']存在于一组值中的每个元素。如果我们的一组值是一组一个值,即 ,则计算结果相同'foo'。但如果需要,它也可以概括为包含更大的值集。事实证明,即使它是一个更通用的解决方案,这仍然相当快。对于那些不熟悉这个概念的人来说,唯一真正的损失是直观性。
mask = df['A'].isin(['foo'])
df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
However, as before, we can utilize numpyto improve performance while sacrificing virtually nothing. We'll use np.in1d
然而,和以前一样,我们可以利用numpy来提高性能,同时几乎不牺牲任何东西。我们会用np.in1d
mask = np.in1d(df['A'].values, ['foo'])
df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
Timing
I'll include other concepts mentioned in other posts as well for reference.
Code Below
时间安排
我将包括其他帖子中提到的其他概念以供参考。
下面的代码
Each Column in this table represents a different length data frame over which we test each function. Each column shows relative time taken, with the fastest function given a base index of 1.0.
此表中的每一列代表一个不同长度的数据帧,我们在其上测试每个函数。每列显示所用的相对时间,最快的函数的基本索引为1.0。
res.div(res.min())
10 30 100 300 1000 3000 10000 30000
mask_standard 2.156872 1.850663 2.034149 2.166312 2.164541 3.090372 2.981326 3.131151
mask_standard_loc 1.879035 1.782366 1.988823 2.338112 2.361391 3.036131 2.998112 2.990103
mask_with_values 1.010166 1.000000 1.005113 1.026363 1.028698 1.293741 1.007824 1.016919
mask_with_values_loc 1.196843 1.300228 1.000000 1.000000 1.038989 1.219233 1.037020 1.000000
query 4.997304 4.765554 5.934096 4.500559 2.997924 2.397013 1.680447 1.398190
xs_label 4.124597 4.272363 5.596152 4.295331 4.676591 5.710680 6.032809 8.950255
mask_with_isin 1.674055 1.679935 1.847972 1.724183 1.345111 1.405231 1.253554 1.264760
mask_with_in1d 1.000000 1.083807 1.220493 1.101929 1.000000 1.000000 1.000000 1.144175
You'll notice that fastest times seem to be shared between mask_with_valuesand mask_with_in1d
您会注意到最快的时间似乎在mask_with_values和mask_with_in1d
res.T.plot(loglog=True)
Functions
职能
def mask_standard(df):
mask = df['A'] == 'foo'
return df[mask]
def mask_standard_loc(df):
mask = df['A'] == 'foo'
return df.loc[mask]
def mask_with_values(df):
mask = df['A'].values == 'foo'
return df[mask]
def mask_with_values_loc(df):
mask = df['A'].values == 'foo'
return df.loc[mask]
def query(df):
return df.query('A == "foo"')
def xs_label(df):
return df.set_index('A', append=True, drop=False).xs('foo', level=-1)
def mask_with_isin(df):
mask = df['A'].isin(['foo'])
return df[mask]
def mask_with_in1d(df):
mask = np.in1d(df['A'].values, ['foo'])
return df[mask]
Testing
测试
res = pd.DataFrame(
index=[
'mask_standard', 'mask_standard_loc', 'mask_with_values', 'mask_with_values_loc',
'query', 'xs_label', 'mask_with_isin', 'mask_with_in1d'
],
columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
dtype=float
)
for j in res.columns:
d = pd.concat([df] * j, ignore_index=True)
for i in res.index:a
stmt = '{}(d)'.format(i)
setp = 'from __main__ import d, {}'.format(i)
res.at[i, j] = timeit(stmt, setp, number=50)
Special Timing
Looking at the special case when we have a single non-object dtypefor the entire data frame.
Code Below
特殊时间
查看dtype整个数据帧只有一个非对象时的特殊情况。
下面的代码
spec.div(spec.min())
10 30 100 300 1000 3000 10000 30000
mask_with_values 1.009030 1.000000 1.194276 1.000000 1.236892 1.095343 1.000000 1.000000
mask_with_in1d 1.104638 1.094524 1.156930 1.072094 1.000000 1.000000 1.040043 1.027100
reconstruct 1.000000 1.142838 1.000000 1.355440 1.650270 2.222181 2.294913 3.406735
Turns out, reconstruction isn't worth it past a few hundred rows.
事实证明,重建不值得超过几百行。
spec.T.plot(loglog=True)
Functions
职能
np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))
def mask_with_values(df):
mask = df['A'].values == 'foo'
return df[mask]
def mask_with_in1d(df):
mask = np.in1d(df['A'].values, ['foo'])
return df[mask]
def reconstruct(df):
v = df.values
mask = np.in1d(df['A'].values, ['foo'])
return pd.DataFrame(v[mask], df.index[mask], df.columns)
spec = pd.DataFrame(
index=['mask_with_values', 'mask_with_in1d', 'reconstruct'],
columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
dtype=float
)
Testing
测试
for j in spec.columns:
d = pd.concat([df] * j, ignore_index=True)
for i in spec.index:
stmt = '{}(d)'.format(i)
setp = 'from __main__ import d, {}'.format(i)
spec.at[i, j] = timeit(stmt, setp, number=50)
回答by SP001
For selecting only specific columns out of multiple columns for a given value in pandas:
对于 Pandas 中的给定值,仅从多列中选择特定列:
select col_name1, col_name2 from table where column_name = some_value.
Options:
选项:
df.loc[df['column_name'] == some_value][[col_name1, col_name2]]
or
或者
df.query['column_name' == 'some_value'][[col_name1, col_name2]]
回答by Vahidn
You can also use .apply:
你也可以使用 .apply:
df.apply(lambda row: row[df['B'].isin(['one','three'])])
It actually works row-wise (i.e., applies the function to each row).
它实际上是按行工作的(即,将函数应用于每一行)。
The output is
输出是
A B C D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
The results is the same as using as mentioned by @unutbu
结果与@unutbu 提到的使用相同
df[[df['B'].isin(['one','three'])]]
回答by Erfan
More flexibility using .querywith pandas >= 0.25.0:
使用.querywith更灵活pandas >= 0.25.0:
August 2019 updated answer
2019 年 8 月更新的答案
Since pandas >= 0.25.0we can use the querymethod to filter dataframes with pandas methods and even column names which have spaces. Normally the spaces in column names would give an error, but now we can solve that using a backtick (`) see GitHub:
因为pandas >= 0.25.0我们可以使用该query方法来过滤带有pandas 方法的数据帧,甚至是带有空格的列名。通常列名中的空格会出错,但现在我们可以使用反引号 (`) 解决该问题,请参阅GitHub:
# Example dataframe
df = pd.DataFrame({'Sender email':['[email protected]', "[email protected]", "[email protected]"]})
Sender email
0 [email protected]
1 [email protected]
2 [email protected]
Using .querywith method str.endswith:
使用.query方法str.endswith:
df.query('`Sender email`.str.endswith("@shop.com")')
Output
输出
Sender email
1 [email protected]
2 [email protected]
Also we can use local variables by prefixing it with an @in our query:
我们也可以通过@在查询中加上前缀来使用局部变量:
domain = 'shop.com'
df.query('`Sender email`.str.endswith(@domain)')
Output
输出
Sender email
1 [email protected]
2 [email protected]


