如何从python中的文件中删除除空格之外的特殊字符?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43358857/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to remove special characters except space from a file in python?
提问by pythonlearn
I have a huge corpus of text (line by line) and I want to remove special characters but sustain the space and structure of the string.
我有一个巨大的文本语料库(逐行),我想删除特殊字符,但要保留字符串的空间和结构。
hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.
should be
应该
hello there A Z R T world welcome to python
this should be the next line followed by another million like this
回答by Chiheb Nexus
You can use this pattern, too, with regex:
您也可以使用此模式regex:
import re
a = '''hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.'''
for k in a.split("\n"):
print(re.sub(r"[^a-zA-Z0-9]+", ' ', k))
# Or:
# final = " ".join(re.findall(r"[a-zA-Z0-9]+", k))
# print(final)
Output:
输出:
hello there A Z R T world welcome to python
this should the next line followed by an other million like this
Edit:
编辑:
Otherwise, you can store the final lines into a list:
否则,您可以将最后几行存储到一个list:
final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("\n")]
print(final)
Output:
输出:
['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']
回答by Eliethesaiyan
I think nfn neil answer is great...but i would just add a simple regex to remove all no words character,however it will consider underscore as part of the word
我认为 nfn neil 的答案很棒……但我只想添加一个简单的正则表达式来删除所有没有单词的字符,但是它会将下划线视为单词的一部分
print re.sub(r'\W+', ' ', string)
>>> hello there A Z R_T world welcome to python
回答by ssp4all
A more elegant solution would be
一个更优雅的解决方案是
print(re.sub(r"\W+|_", " ", string))
print(re.sub(r"\W+|_", " ", string))
>>> hello there A Z R T world welcome to python this should the next line followed by another million like this
>>> hello there A Z R T world welcome to python this should the next line followed by another million like this
Here,
reis regexmodule in python
这里
re是regexpython中的模块
re.subwill substitute pattern with space i.e., " "
re.sub将用空间替换模式,即, " "
r''will treat input string as raw (with \n)
r''将输入字符串视为原始字符串 (with \n)
\Wfor all non-words i.e. all special characters *&^%$ etc excluding underscore _
\W对于所有非单词,即所有特殊字符 *&^%$ 等,不包括下划线 _
+will match zero to unlimited matches, similar to * (one to more)
+将匹配零到无限匹配,类似于 *(一对多)
|is logical OR
|是逻辑 OR
_stands for underscore
_代表下划线
回答by wwii
Create a dictionary mapping special characters to None
创建一个将特殊字符映射到无的字典
d = {c:None for c in special_characters}
Make a translation tableusing the dictionary. Read the entire text into a variable and use str.translateon the entire text.
使用字典制作翻译表。将整个文本读入一个变量并对整个文本使用str.translate。

