pandas 替换熊猫数据框列中的子字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32902837/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:57:53  来源:igfitidea点击:

replace substring in pandas data frame column

pandasreplacesubstringdataframe

提问by Felix

I am working with dataframe that contains column named "raw_parameter_name". In this column i have different string values. Several values are like following pattern "ABCD;MEAN". What i am trying to do is to replace each value "ABCD;MEAN" with "ABCD;X-BAR". Sub string "ABCD" may vary but pattern ";MEAN" is constant i want to replace. Looked into different options using "replace" method but don't know how to replace sub string only and not whole string. Please advise. Thank you in advance

我正在使用包含名为“raw_parameter_name”的列的数据框。在此列中,我有不同的字符串值。几个值类似于以下模式“ABCD;MEAN”。我想要做的是用“ABCD;X-BAR”替换每个值“ABCD;MEAN”。子字符串“ABCD”可能会有所不同,但我想替换的模式“;MEAN”是不变的。使用“替换”方法查看了不同的选项,但不知道如何仅替换子字符串而不是整个字符串。请指教。先感谢您

回答by EdChum

use str.containsto create a boolean index to mask the series and then str.replaceto replace your substring:

用于str.contains创建布尔索引来屏蔽系列,然后str.replace替换您的子字符串:

In [172]:
df = pd.DataFrame({'raw_parameter_name':['ABCD;MEAN', 'EFGH;MEAN', '1234;MEAN', 'sdasd;MEAT']})
df

Out[172]:
  raw_parameter_name
0          ABCD;MEAN
1          EFGH;MEAN
2          1234;MEAN
3         sdasd;MEAT

In [173]:
df.loc[df['raw_parameter_name'].str.contains(';MEAN$'), 'raw_parameter_name'] = df['raw_parameter_name'].str.replace('MEAN', 'X-BAR')
df

Out[173]:
  raw_parameter_name
0           ABCD;X-BAR
1           EFGH;X-BAR
2           1234;X-BAR
3         sdasd;MEAT

Here it matches where the substrin ';MEAN'exists the $is a terminating symbol

这里它匹配 substrin';MEAN'存在的地方,$是一个终止符号

The boolean mask looks like the following:

布尔掩码如下所示:

In [176]:
df['raw_parameter_name'].str.contains(';MEAN$')

Out[176]:
0     True
1     True
2     True
3    False
Name: raw_parameter_name, dtype: bool

Timings

时间安排

For a 40,0000 row df using str.replaceis faster than using apply:

对于 40,0000 行 df 使用str.replace比使用更快apply

In [183]:
import re
%timeit df['raw_parameter_name'].apply(lambda x: re.sub(';MEAN$',';X-BAR',x))
%timeit df['raw_parameter_name'].str.replace('MEAN', 'X-BAR')
?
1 loops, best of 3: 1.01 s per loop
1 loops, best of 3: 687 ms per loop

回答by Colonel Beauvel

You can use regex module refor example:

re例如,您可以使用正则表达式模块:

import pandas as pd
import re

df = pd.DataFrame({"row_parameter_name":['abcd;MEAN','Dogg11;MEAN',';MEAN']})

Out[126]:
  row_parameter_name
0          abcd;MEAN
1        Dogg11;MEAN
2              ;MEAN 

df['row_parameter_name'] = df['row_parameter_name'].apply(lambda x: re.sub(';MEAN$',';X-BAR',x))

In [128]: df
Out[128]:
  row_parameter_name
0         abcd;X-BAR
1       Dogg11;X-BAR
2             ;X-BAR

回答by Alnilam

You do not have to use relike in the example that was marked correct above. It may have been necessary at one point in time, but this is not the best answer to this anymore.

您不必re在上面标记为正确的示例中使用like。在某个时间点可能有必要这样做,但这不再是对此的最佳答案。

Nor do you need to use str.contains()first.

也不需要先使用str.contains()

Instead just use .str.replace()with the appropriate match and replacement.

而只是使用.str.replace()适当的匹配和替换。

In [2]: df = pd.DataFrame({"row_parameter_name":['abcd;MEAN','Nothing;NICE','Dogg11;MEAN',';MEAN','MEANY-MEANY;MEAN']})

In [3]: df
Out[3]: row_parameter_name
        0   abcd;MEAN
        1   Nothing;NICE
        2   Dogg11;MEAN
        3   ;MEAN
        4   MEANY-MEANY;MEAN

In [4]: df.row_parameter_name.str.replace("MEAN$","X-BAR")
Out[4]: 0           abcd;X-BAR
        1         Nothing;NICE
        2         Dogg11;X-BAR
        3               ;X-BAR
        4    MEANY-MEANY;X-BAR
        Name: row_parameter_name, dtype: object