如何从python中的文件中删除除空格之外的特殊字符?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43358857/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to remove special characters except space from a file in python?
提问by pythonlearn
I have a huge corpus of text (line by line) and I want to remove special characters but sustain the space and structure of the string.
我有一个巨大的文本语料库(逐行),我想删除特殊字符,但要保留字符串的空间和结构。
hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.
should be
应该
hello there A Z R T world welcome to python
this should be the next line followed by another million like this
回答by Chiheb Nexus
You can use this pattern, too, with regex
:
您也可以使用此模式regex
:
import re
a = '''hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.'''
for k in a.split("\n"):
print(re.sub(r"[^a-zA-Z0-9]+", ' ', k))
# Or:
# final = " ".join(re.findall(r"[a-zA-Z0-9]+", k))
# print(final)
Output:
输出:
hello there A Z R T world welcome to python
this should the next line followed by an other million like this
Edit:
编辑:
Otherwise, you can store the final lines into a list
:
否则,您可以将最后几行存储到一个list
:
final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("\n")]
print(final)
Output:
输出:
['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']
回答by Eliethesaiyan
I think nfn neil answer is great...but i would just add a simple regex to remove all no words character,however it will consider underscore as part of the word
我认为 nfn neil 的答案很棒……但我只想添加一个简单的正则表达式来删除所有没有单词的字符,但是它会将下划线视为单词的一部分
print re.sub(r'\W+', ' ', string)
>>> hello there A Z R_T world welcome to python
回答by ssp4all
A more elegant solution would be
一个更优雅的解决方案是
print(re.sub(r"\W+|_", " ", string))
print(re.sub(r"\W+|_", " ", string))
>>> hello there A Z R T world welcome to python this should the next line followed by another million like this
>>> hello there A Z R T world welcome to python this should the next line followed by another million like this
Here,
re
is regex
module in python
这里
re
是regex
python中的模块
re.sub
will substitute pattern with space i.e., " "
re.sub
将用空间替换模式,即, " "
r''
will treat input string as raw (with \n)
r''
将输入字符串视为原始字符串 (with \n)
\W
for all non-words i.e. all special characters *&^%$ etc excluding underscore _
\W
对于所有非单词,即所有特殊字符 *&^%$ 等,不包括下划线 _
+
will match zero to unlimited matches, similar to * (one to more)
+
将匹配零到无限匹配,类似于 *(一对多)
|
is logical OR
|
是逻辑 OR
_
stands for underscore
_
代表下划线
回答by wwii
Create a dictionary mapping special characters to None
创建一个将特殊字符映射到无的字典
d = {c:None for c in special_characters}
Make a translation tableusing the dictionary. Read the entire text into a variable and use str.translateon the entire text.
使用字典制作翻译表。将整个文本读入一个变量并对整个文本使用str.translate。