pandas 如何在熊猫中选择不以某些 str 开头的行?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41689722/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to select rows that do not start with some str in pandas?
提问by running man
I want to select rows that the values do not start with some str. For example, I have a pandas df
, and I want to select data do not start with t
, and c
. In this sample, the output should be mext1
and okl1
.
我想选择值不以某些 str 开头的行。例如,我有一个 pandas df
,我想选择不以t
, 和开头的数据c
。在此示例中,输出应为mext1
和okl1
。
import pandas as pd
df=pd.DataFrame({'col':['text1','mext1','cext1','okl1']})
df
col
0 text1
1 mext1
2 cext1
3 okl1
I want this:
我要这个:
col
0 mext1
1 okl1
回答by Ted Petrou
You can use the str accessor to get string functionality. The get
method can grab a given index of the string.
您可以使用 str 访问器来获取字符串功能。该get
方法可以获取字符串的给定索引。
df[~df.col.str.get(0).isin(['t', 'c'])]
col
1 mext1
3 okl1
Looks like you can use startswith
as well with a tuple (and not a list) of the values you want to exclude.
看起来您也可以使用startswith
要排除的值的元组(而不是列表)。
df[~df.col.str.startswith(('t', 'c'))]
回答by piRSquared
option 1
use str.match
and negative look ahead
选项 1
使用str.match
和负面展望
df[df.col.str.match('^(?![tc])')]
option 2
within query
选项 2
内query
df.query('col.str[0] not list("tc")')
option 3numpy
broadcasting
选项 3numpy
广播
df[(df.col.str[0][:, None] == ['t', 'c']).any(1)]
col
1 mext1
3 okl1
time testing
时间测试
def ted(df):
return df[~df.col.str.get(0).isin(['t', 'c'])]
def adele(df):
return df[~df['col'].str.startswith(('t','c'))]
def yohanes(df):
return df[df.col.str.contains('^[^tc]')]
def pir1(df):
return df[df.col.str.match('^(?![tc])')]
def pir2(df):
return df.query('col.str[0] not in list("tc")')
def pir3(df):
df[(df.col.str[0][:, None] == ['t', 'c']).any(1)]
functions = pd.Index(['ted', 'adele', 'yohanes', 'pir1', 'pir2', 'pir3'], name='Method')
lengths = pd.Index([10, 100, 1000, 5000, 10000], name='Length')
results = pd.DataFrame(index=lengths, columns=functions)
from string import ascii_lowercase
for i in lengths:
a = np.random.choice(list(ascii_lowercase), i)
df = pd.DataFrame(dict(col=a))
for j in functions:
results.set_value(
i, j,
timeit(
'{}(df)'.format(j),
'from __main__ import df, {}'.format(j),
number=1000
)
)
fig, axes = plt.subplots(3, 1, figsize=(8, 12))
results.plot(ax=axes[0], title='All Methods')
results.drop('pir2', 1).plot(ax=axes[1], title='Drop `pir2`')
results[['ted', 'adele', 'pir3']].plot(ax=axes[2], title='Just the fast ones')
fig.tight_layout()
回答by ade1e
You can use str.startswith
and negate it.
你可以使用str.startswith
和否定它。
df[~df['col'].str.startswith('t') &
~df['col'].str.startswith('c')]
col
1 mext1
3 okl1
Or the better option, with multiple characters in a tuple as per @Ted Petrou:
或者更好的选择,按照@Ted Petrou 在元组中包含多个字符:
df[~df['col'].str.startswith(('t','c'))]
col
1 mext1
3 okl1
回答by Yohanes Gultom
Just another alternative in case you prefer regex:
如果您更喜欢正则表达式,这是另一种选择:
df1[df1.col.str.contains('^[^tc]')]
回答by Yantao Xie
You can use the apply
method.
您可以使用该apply
方法。
Take your question as a example, the code is like this
以你的问题为例,代码是这样的
df[df['col'].apply(lambda x: x[0] not in ['t', 'c'])]
I think apply
is a more general and flexible method.
我认为apply
是一种更通用和灵活的方法。