Python,从字符串中删除所有非字母字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22520932/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python, remove all non-alphabet chars from string
提问by KDecker
I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything but alphanumeric chars from a string in Pythonwhich shows a nice solution using regex, but I am not sure how to implement it
我正在编写一个 python MapReduce 字数统计程序。问题是数据中散布着许多非字母字符,我发现这篇文章从 Python中的字符串中剥离除字母数字字符之外的所有内容,这显示了使用正则表达式的不错解决方案,但我不确定如何实现它
def mapfn(k, v):
print v
import re, string
pattern = re.compile('[\W_]+')
v = pattern.match(v)
print v
for w in v.split():
yield w, 1
I'm afraid I am not sure how to use the library reor even regex for that matter. I am not sure how to apply the regex pattern to the incoming string (line of a book) vproperly to retrieve the new line without any non-alphanumeric chars.
恐怕我不确定如何使用库re甚至正则表达式。我不确定如何正确地将正则表达式模式应用于传入的字符串(书的行)v以检索没有任何非字母数字字符的新行。
Suggestions?
建议?
采纳答案by limasxgoesto0
Use re.sub
用 re.sub
import re
regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', 'ab3d*E')
#Out: 'abdE'
Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input...)
或者,如果您只想删除一组特定的字符(因为在您的输入中使用撇号可能没问题...)
regex = re.compile('[,\.!?]') #etc.
回答by Kevin
You can use the re.sub() function to remove these characters:
您可以使用 re.sub() 函数删除这些字符:
>>> import re
>>> re.sub("[^a-zA-Z]+", "", "ABC12abc345def")
'ABCabcdef'
re.sub(MATCH PATTERN, REPLACE STRING, STRING TO SEARCH)
re.sub(匹配模式,替换字符串,要搜索的字符串)
"[^a-zA-Z]+"- look for any group of characters that are NOT a-zA-z.""- Replace the matched characters with ""
"[^a-zA-Z]+"- 查找任何不是 a-zA-z 的字符组。""- 用“”替换匹配的字符
回答by Don
Try:
尝试:
s = ''.join(filter(str.isalnum, s))
This will take every char from the string, keep only alphanumeric ones and build a string back from them.
这将从字符串中取出每个字符,只保留字母数字字符并从它们构建一个字符串。
回答by Tad
If you prefer not to use regex, you might try
如果您不想使用正则表达式,您可以尝试
''.join([i for i in s if i.isalpha()])
回答by PirateApp
The fastest method is regex
最快的方法是正则表达式
#Try with regex first
t0 = timeit.timeit("""
s = r2.sub('', st)
""", setup = """
import re
r2 = re.compile(r'[^a-zA-Z0-9]', re.MULTILINE)
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)
#Try with join method on filter
t0 = timeit.timeit("""
s = ''.join(filter(str.isalnum, st))
""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""",
number = 1000000)
print(t0)
#Try with only join
t0 = timeit.timeit("""
s = ''.join(c for c in st if c.isalnum())
""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)
2.6002226710006653 Method 1 Regex
5.739747313000407 Method 2 Filter + Join
6.540099570000166 Method 3 Join

