pandas 如何在熊猫中选择不以某些 str 开头的行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41689722/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:48:15  来源:igfitidea点击:

How to select rows that do not start with some str in pandas?

pythonpandasnumpy

提问by running man

I want to select rows that the values do not start with some str. For example, I have a pandas df, and I want to select data do not start with t, and c. In this sample, the output should be mext1and okl1.

我想选择值不以某些 str 开头的行。例如,我有一个 pandas df,我想选择不以t, 和开头的数据c。在此示例中,输出应为mext1okl1

import pandas as pd

df=pd.DataFrame({'col':['text1','mext1','cext1','okl1']})
df

    col
0   text1
1   mext1
2   cext1
3   okl1

I want this:

我要这个:

    col
0   mext1
1   okl1

回答by Ted Petrou

You can use the str accessor to get string functionality. The getmethod can grab a given index of the string.

您可以使用 str 访问器来获取字符串功能。该get方法可以获取字符串的给定索引。

df[~df.col.str.get(0).isin(['t', 'c'])]

     col
1  mext1
3   okl1

Looks like you can use startswithas well with a tuple (and not a list) of the values you want to exclude.

看起来您也可以使用startswith要排除的值的元组(而不是列表)。

df[~df.col.str.startswith(('t', 'c'))]

回答by piRSquared

option 1
use str.matchand negative look ahead

选项 1
使用str.match和负面展望

df[df.col.str.match('^(?![tc])')]

option 2
within query

选项 2
query

df.query('col.str[0] not list("tc")')

option 3
numpybroadcasting

选项 3
numpy广播

df[(df.col.str[0][:, None] == ['t', 'c']).any(1)]


         col
1  mext1
3   okl1


time testing

时间测试

def ted(df):
    return df[~df.col.str.get(0).isin(['t', 'c'])]

def adele(df):
    return df[~df['col'].str.startswith(('t','c'))]

def yohanes(df):
    return df[df.col.str.contains('^[^tc]')]

def pir1(df):
    return df[df.col.str.match('^(?![tc])')]

def pir2(df):
    return df.query('col.str[0] not in list("tc")')

def pir3(df):
    df[(df.col.str[0][:, None] == ['t', 'c']).any(1)]

functions = pd.Index(['ted', 'adele', 'yohanes', 'pir1', 'pir2', 'pir3'], name='Method')
lengths = pd.Index([10, 100, 1000, 5000, 10000], name='Length')
results = pd.DataFrame(index=lengths, columns=functions)

from string import ascii_lowercase

for i in lengths:
    a = np.random.choice(list(ascii_lowercase), i)
    df = pd.DataFrame(dict(col=a))
    for j in functions:
        results.set_value(
            i, j,
            timeit(
                '{}(df)'.format(j),
                'from __main__ import df, {}'.format(j),
                number=1000
            )
        )

fig, axes = plt.subplots(3, 1, figsize=(8, 12))
results.plot(ax=axes[0], title='All Methods')
results.drop('pir2', 1).plot(ax=axes[1], title='Drop `pir2`')
results[['ted', 'adele', 'pir3']].plot(ax=axes[2], title='Just the fast ones')
fig.tight_layout()

enter image description here

在此处输入图片说明

回答by ade1e

You can use str.startswithand negate it.

你可以使用str.startswith和否定它。

    df[~df['col'].str.startswith('t') & 
       ~df['col'].str.startswith('c')]

col
1   mext1
3   okl1

Or the better option, with multiple characters in a tuple as per @Ted Petrou:

或者更好的选择,按照@Ted Petrou 在元组中包含多个字符:

df[~df['col'].str.startswith(('t','c'))]

    col
1   mext1
3   okl1

回答by Yohanes Gultom

Just another alternative in case you prefer regex:

如果您更喜欢正则表达式,这是另一种选择:

df1[df1.col.str.contains('^[^tc]')]

回答by Yantao Xie

You can use the applymethod.

您可以使用该apply方法。

Take your question as a example, the code is like this

以你的问题为例,代码是这样的

df[df['col'].apply(lambda x: x[0] not in ['t', 'c'])]

I think applyis a more general and flexible method.

我认为apply是一种更通用和灵活的方法。