Python 从列中的字符串中删除不需要的部分

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13682044/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 09:23:18  来源:igfitidea点击:

Remove unwanted parts from strings in a column

pythonstringpandasdataframe

提问by Yannan Wang

I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.

我正在寻找一种有效的方法来从 DataFrame 列中的字符串中删除不需要的部分。

Data looks like:

数据看起来像:

    time    result
1    09:00   +52A
2    10:00   +62B
3    11:00   +44a
4    12:00   +30b
5    13:00   -110a

I need to trim these data to:

我需要将这些数据修剪为:

    time    result
1    09:00   52
2    10:00   62
3    11:00   44
4    12:00   30
5    13:00   110

I tried .str.lstrip('+-')and .str.rstrip('aAbBcC'), but got an error:

我试过.str.lstrip('+-')和。str.rstrip('aAbBcC'),但出现错误:

TypeError: wrapper() takes exactly 1 argument (2 given)

Any pointers would be greatly appreciated!

任何指针将不胜感激!

采纳答案by eumiro

data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))

回答by Wes McKinney

There's a bug here: currently cannot pass arguments to str.lstripand str.rstrip:

这里有一个错误:目前无法将参数传递给str.lstripand str.rstrip

http://github.com/pydata/pandas/issues/2411

http://github.com/pydata/pandas/issues/2411

EDIT: 2012-12-07 this works now on the dev branch:

编辑:2012-12-07 这现在在 dev 分支上工作:

In [8]: df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
Out[8]: 
1     52
2     62
3     44
4     30
5    110
Name: result

回答by prl900

In the particular case where you know the number of positions that you want to remove from the dataframe column, you can use string indexing inside a lambda function to get rid of that parts:

在您知道要从数据帧列中删除的位置数的特定情况下,您可以在 lambda 函数中使用字符串索引来删除这些部分:

Last character:

最后一个字符:

data['result'] = data['result'].map(lambda x: str(x)[:-1])

First two characters:

前两个字符:

data['result'] = data['result'].map(lambda x: str(x)[2:])

回答by Coder375

i'd use the pandas replace function, very simple and powerful as you can use regex. Below i'm using the regex \D to remove any non-digit characters but obviously you could get quite creative with regex.

我会使用熊猫替换功能,非常简单和强大,因为您可以使用正则表达式。下面我使用正则表达式 \D 来删除任何非数字字符,但显然您可以使用正则表达式变得非常有创意。

data['result'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

回答by tim654321

I often use list comprehensions for these types of tasks because they're often faster.

我经常对这些类型的任务使用列表推导式,因为它们通常更快。

There can be big differences in performance between the various methods for doing things like this (i.e. modifying every element of a series within a DataFrame). Often a list comprehension can be fastest - see code race below for this task:

执行此类操作的各种方法之间的性能可能存在很大差异(即修改 DataFrame 中系列的每个元素)。通常列表理解可能是最快的 - 请参阅下面的代码竞赛以了解此任务:

import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 μs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 μs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 μs per loop

回答by Ted Petrou

A very simple method would be to use the extractmethod to select all the digits. Simply supply it the regular expression '\d+'which extracts any number of digits.

一个非常简单的方法是使用该extract方法来选择所有数字。只需'\d+'为其提供提取任意数量数字的正则表达式即可。

df['result'] = df.result.str.extract(r'(\d+)', expand=True).astype(int)
df

    time  result
1  09:00      52
2  10:00      62
3  11:00      44
4  12:00      30
5  13:00     110

回答by cs95

How do I remove unwanted parts from strings in a column?

如何从列中的字符串中删除不需要的部分?

6 years after the original question was posted, pandas now has a good number of "vectorised" string functions that can succinctly perform these string manipulation operations.

原始问题发布 6 年后,pandas 现在拥有大量“矢量化”字符串函数,可以简洁地执行这些字符串操作操作。

This answer will explore some of these string functions, suggest faster alternatives, and go into a timings comparison at the end.

这个答案将探讨其中一些字符串函数,提出更快的替代方案,并在最后进行时序比较。



.str.replace

.str.replace

Specify the substring/pattern to match, and the substring to replace it with.

指定要匹配的子字符串/模式,以及要替换的子字符串。

pd.__version__
# '0.24.1'

df    
    time result
1  09:00   +52A
2  10:00   +62B
3  11:00   +44a
4  12:00   +30b
5  13:00  -110a

df['result'] = df['result'].str.replace(r'\D', '')
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If you need the result converted to an integer, you can use Series.astype,

如果需要将结果转换为整数,可以使用Series.astype,

df['result'] = df['result'].str.replace(r'\D', '').astype(int)

df.dtypes
time      object
result     int64
dtype: object

If you don't want to modify dfin-place, use DataFrame.assign:

如果您不想df就地修改,请使用DataFrame.assign

df2 = df.assign(result=df['result'].str.replace(r'\D', ''))
df
# Unchanged


.str.extract

.str.extract

Useful for extracting the substring(s) you want to keep.

用于提取要保留的子字符串。

df['result'] = df['result'].str.extract(r'(\d+)', expand=False)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

With extract, it is necessary to specify at least one capture group. expand=Falsewill return a Series with the captured items from the first capture group.

使用extract,必须至少指定一个捕获组。expand=False将返回一个包含第一个捕获组中捕获的项目的系列。



.str.splitand .str.get

.str.split.str.get

Splitting works assuming all your strings follow this consistent structure.

拆分工作假设您的所有字符串都遵循这种一致的结构。

# df['result'] = df['result'].str.split(r'\D').str[1]
df['result'] = df['result'].str.split(r'\D').str.get(1)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

Do not recommend if you are looking for a general solution.

如果您正在寻找通用解决方案,请不要推荐。



If you are satisfied with the succinct and readable straccessor-based solutions above, you can stop here. However, if you are interested in faster, more performant alternatives, keep reading.

如果您对str上面基于访问器的简洁易读的解决方案感到满意,则可以到此为止。但是,如果您对更快、更高效的替代方案感兴趣,请继续阅读。



Optimizing: List Comprehensions

优化:列表推导式

In some circumstances, list comprehensions should be favoured over pandas string functions. The reason is because string functions are inherently hard to vectorize (in the true sense of the word), so most string and regex functions are only wrappers around loops with more overhead.

在某些情况下,列表推导式应该比 Pandas 字符串函数更受青睐。原因是因为字符串函数本质上很难向量化(在这个词的真正意义上),所以大多数字符串和正则表达式函数只是围绕具有更多开销的循环的包装器。

My write-up, Are for-loops in pandas really bad? When should I care?, goes into greater detail.

我的文章,熊猫中的 for 循环真的很糟糕吗?我什么时候应该关心?,更详细地介绍。

The str.replaceoption can be re-written using re.sub

str.replace选项可以使用重写re.sub

import re

# Pre-compile your regex pattern for more performance.
p = re.compile(r'\D')
df['result'] = [p.sub('', x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

The str.extractexample can be re-written using a list comprehension with re.search,

str.extract可以使用列表推导式重写该示例re.search

p = re.compile(r'\d+')
df['result'] = [p.search(x)[0] for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If NaNs or no-matches are a possibility, you will need to re-write the above to include some error checking. I do this using a function.

如果可能出现 NaN 或不匹配,则需要重新编写上述内容以包含一些错误检查。我使用一个函数来做到这一点。

def try_extract(pattern, string):
    try:
        m = pattern.search(string)
        return m.group(0)
    except (TypeError, ValueError, AttributeError):
        return np.nan

p = re.compile(r'\d+')
df['result'] = [try_extract(p, x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

We can also re-write @eumiro's and @MonkeyButter's answers using list comprehensions:

我们还可以使用列表推导式重写 @eumiro 和 @MonkeyButter 的答案:

df['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in df['result']]

And,

和,

df['result'] = [x[1:-1] for x in df['result']]

Same rules for handling NaNs, etc, apply.

适用于处理 NaN 等的相同规则。



Performance Comparison

性能比较

enter image description here

在此处输入图片说明

Graphs generated using perfplot. Full code listing, for your reference.The relevant functions are listed below.

使用perfplot生成的图形完整的代码清单,供您参考。下面列出了相关的功能。

Some of these comparisons are unfair because they take advantage of the structure of OP's data, but take from it what you will. One thing to note is that every list comprehension function is either faster or comparable than its equivalent pandas variant.

其中一些比较是不公平的,因为它们利用了 OP 数据的结构,但可以从中获取您想要的内容。需要注意的一件事是,每个列表理解函数都比其等效的 Pandas 变体更快或具有可比性。

Functions

职能

def eumiro(df):
    return df.assign(
        result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC')))

def coder375(df):
    return df.assign(
        result=df['result'].replace(r'\D', r'', regex=True))

def monkeybutter(df):
    return df.assign(result=df['result'].map(lambda x: x[1:-1]))

def wes(df):
    return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC'))

def cs1(df):
    return df.assign(result=df['result'].str.replace(r'\D', ''))

def cs2_ted(df):
    # `str.extract` based solution, similar to @Ted Petrou's. so timing together.
    return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False))

def cs1_listcomp(df):
    return df.assign(result=[p1.sub('', x) for x in df['result']])

def cs2_listcomp(df):
    return df.assign(result=[p2.search(x)[0] for x in df['result']])

def cs_eumiro_listcomp(df):
    return df.assign(
        result=[x.lstrip('+-').rstrip('aAbBcC') for x in df['result']])

def cs_mb_listcomp(df):
    return df.assign(result=[x[1:-1] for x in df['result']])
def eumiro(df):
    return df.assign(
        result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC')))

def coder375(df):
    return df.assign(
        result=df['result'].replace(r'\D', r'', regex=True))

def monkeybutter(df):
    return df.assign(result=df['result'].map(lambda x: x[1:-1]))

def wes(df):
    return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC'))

def cs1(df):
    return df.assign(result=df['result'].str.replace(r'\D', ''))

def cs2_ted(df):
    # `str.extract` based solution, similar to @Ted Petrou's. so timing together.
    return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False))

def cs1_listcomp(df):
    return df.assign(result=[p1.sub('', x) for x in df['result']])

def cs2_listcomp(df):
    return df.assign(result=[p2.search(x)[0] for x in df['result']])

def cs_eumiro_listcomp(df):
    return df.assign(
        result=[x.lstrip('+-').rstrip('aAbBcC') for x in df['result']])

def cs_mb_listcomp(df):
    return df.assign(result=[x[1:-1] for x in df['result']])

回答by Mr. Prophet

Try this using regular expression:

使用正则表达式试试这个:

import re
data['result'] = data['result'].map(lambda x: re.sub('[-+A-Za-z]',x)

回答by Rishi Bansal

Suppose your DF is having those extra character in between numbers as well.The last entry.

假设您的 DF 在数字之间也有那些额外的字符。最后一个条目。

  result   time
0   +52A  09:00
1   +62B  10:00
2   +44a  11:00
3   +30b  12:00
4  -110a  13:00
5   3+b0  14:00

You can try str.replace to remove characters not only from start and end but also from in between.

您可以尝试 str.replace 不仅从开头和结尾删除字符,还可以从中间删除字符。

DF['result'] = DF['result'].str.replace('\+|a|b|\-|A|B', '')

Output:

输出:

  result   time
0     52  09:00
1     62  10:00
2     44  11:00
3     30  12:00
4    110  13:00
5     30  14:00