Python,从字符串中删除所有非字母字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22520932/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:05:50  来源:igfitidea点击:

Python, remove all non-alphabet chars from string

pythonregex

提问by KDecker

I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything but alphanumeric chars from a string in Pythonwhich shows a nice solution using regex, but I am not sure how to implement it

我正在编写一个 python MapReduce 字数统计程序。问题是数据中散布着许多非字母字符,我发现这篇文章从 Python的字符串中剥离除字母数字字符之外的所有内容,这显示了使用正则表达式的不错解决方案,但我不确定如何实现它

def mapfn(k, v):
    print v
    import re, string 
    pattern = re.compile('[\W_]+')
    v = pattern.match(v)
    print v
    for w in v.split():
        yield w, 1

I'm afraid I am not sure how to use the library reor even regex for that matter. I am not sure how to apply the regex pattern to the incoming string (line of a book) vproperly to retrieve the new line without any non-alphanumeric chars.

恐怕我不确定如何使用库re甚至正则表达式。我不确定如何正确地将正则表达式模式应用于传入的字符串(书的行)v以检索没有任何非字母数字字符的新行。

Suggestions?

建议?

采纳答案by limasxgoesto0

Use re.sub

re.sub

import re

regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', 'ab3d*E')
#Out: 'abdE'

Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input...)

或者,如果您只想删除一组特定的字符(因为在您的输入中使用撇号可能没问题...)

regex = re.compile('[,\.!?]') #etc.

回答by Kevin

You can use the re.sub() function to remove these characters:

您可以使用 re.sub() 函数删除这些字符:

>>> import re
>>> re.sub("[^a-zA-Z]+", "", "ABC12abc345def")
'ABCabcdef'

re.sub(MATCH PATTERN, REPLACE STRING, STRING TO SEARCH)

re.sub(匹配模式,替换字符串,要搜索的字符串)

  • "[^a-zA-Z]+"- look for any group of characters that are NOT a-zA-z.
  • ""- Replace the matched characters with ""
  • "[^a-zA-Z]+"- 查找任何不是 a-zA-z 的字符组。
  • ""- 用“”替换匹配的字符

回答by Don

Try:

尝试:

s = ''.join(filter(str.isalnum, s))

This will take every char from the string, keep only alphanumeric ones and build a string back from them.

这将从字符串中取出每个字符,只保留字母数字字符并从它们构建一个字符串。

回答by Tad

If you prefer not to use regex, you might try

如果您不想使用正则表达式,您可以尝试

''.join([i for i in s if i.isalpha()])

回答by PirateApp

The fastest method is regex

最快的方法是正则表达式

#Try with regex first
t0 = timeit.timeit("""
s = r2.sub('', st)

""", setup = """
import re
r2 = re.compile(r'[^a-zA-Z0-9]', re.MULTILINE)
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)

#Try with join method on filter
t0 = timeit.timeit("""
s = ''.join(filter(str.isalnum, st))

""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""",
number = 1000000)
print(t0)

#Try with only join
t0 = timeit.timeit("""
s = ''.join(c for c in st if c.isalnum())

""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)


2.6002226710006653 Method 1 Regex
5.739747313000407 Method 2 Filter + Join
6.540099570000166 Method 3 Join