Python，从字符串中删除所有非字母字符

Question

提问by KDecker

I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything but alphanumeric chars from a string in Pythonwhich shows a nice solution using regex, but I am not sure how to implement it

我正在编写一个 python MapReduce 字数统计程序。问题是数据中散布着许多非字母字符，我发现这篇文章从 Python中的字符串中剥离除字母数字字符之外的所有内容，这显示了使用正则表达式的不错解决方案，但我不确定如何实现它

def mapfn(k, v):
    print v
    import re, string 
    pattern = re.compile('[\W_]+')
    v = pattern.match(v)
    print v
    for w in v.split():
        yield w, 1

I'm afraid I am not sure how to use the library reor even regex for that matter. I am not sure how to apply the regex pattern to the incoming string (line of a book) vproperly to retrieve the new line without any non-alphanumeric chars.

恐怕我不确定如何使用库re甚至正则表达式。我不确定如何正确地将正则表达式模式应用于传入的字符串（书的行）v以检索没有任何非字母数字字符的新行。

Suggestions?

建议？

Answer 1

采纳答案by limasxgoesto0

Use re.sub

用 re.sub

import re

regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', 'ab3d*E')
#Out: 'abdE'

Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input...)

或者，如果您只想删除一组特定的字符（因为在您的输入中使用撇号可能没问题...）

regex = re.compile('[,\.!?]') #etc.

Answer 2

回答by Kevin

You can use the re.sub() function to remove these characters:

您可以使用 re.sub() 函数删除这些字符：

>>> import re
>>> re.sub("[^a-zA-Z]+", "", "ABC12abc345def")
'ABCabcdef'

re.sub(MATCH PATTERN, REPLACE STRING, STRING TO SEARCH)

re.sub（匹配模式，替换字符串，要搜索的字符串）

"[^a-zA-Z]+"- look for any group of characters that are NOT a-zA-z.
""- Replace the matched characters with ""

"[^a-zA-Z]+"- 查找任何不是 a-zA-z 的字符组。
""- 用“”替换匹配的字符

Answer 3

回答by Don

Try:

尝试：

s = ''.join(filter(str.isalnum, s))

This will take every char from the string, keep only alphanumeric ones and build a string back from them.

这将从字符串中取出每个字符，只保留字母数字字符并从它们构建一个字符串。

Answer 4

回答by Tad

If you prefer not to use regex, you might try

如果您不想使用正则表达式，您可以尝试

''.join([i for i in s if i.isalpha()])

Answer 5

回答by PirateApp

The fastest method is regex

最快的方法是正则表达式

#Try with regex first
t0 = timeit.timeit("""
s = r2.sub('', st)

""", setup = """
import re
r2 = re.compile(r'[^a-zA-Z0-9]', re.MULTILINE)
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)

#Try with join method on filter
t0 = timeit.timeit("""
s = ''.join(filter(str.isalnum, st))

""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""",
number = 1000000)
print(t0)

#Try with only join
t0 = timeit.timeit("""
s = ''.join(c for c in st if c.isalnum())

""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)


2.6002226710006653 Method 1 Regex
5.739747313000407 Method 2 Filter + Join
6.540099570000166 Method 3 Join

Python，从字符串中删除所有非字母字符

提问by KDecker

采纳答案by limasxgoesto0

回答by Kevin

回答by Don

回答by Tad

回答by PirateApp

相关推荐

最近更新

标签

Python，从字符串中删除所有非字母字符

提问by KDecker

采纳答案by limasxgoesto0

回答by Kevin

回答by Don

回答by Tad

回答by PirateApp

相关推荐

Python ValueError：无法解码 JSON 对象

如何在 Python 3 中使用 cmp()？

Python 如何在请求中获取页面标题

应用自定义 groupby 聚合函数在 Pandas python 中输出二进制结果

相关推荐

最近更新

标签