如何从python中的文件中删除除空格之外的特殊字符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43358857/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:57:36  来源:igfitidea点击:

How to remove special characters except space from a file in python?

pythonregexstringfilestring-formatting

提问by pythonlearn

I have a huge corpus of text (line by line) and I want to remove special characters but sustain the space and structure of the string.

我有一个巨大的文本语料库(逐行),我想删除特殊字符,但要保留字符串的空间和结构。

hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.

should be

应该

hello there A Z R T world welcome to python
this should be the next line followed by another million like this

回答by Chiheb Nexus

You can use this pattern, too, with regex:

您也可以使用此模式regex

import re
a = '''hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.'''

for k in a.split("\n"):
    print(re.sub(r"[^a-zA-Z0-9]+", ' ', k))
    # Or:
    # final = " ".join(re.findall(r"[a-zA-Z0-9]+", k))
    # print(final)

Output:

输出:

hello there A Z R T world welcome to python 
this should the next line followed by an other million like this 

Edit:

编辑:

Otherwise, you can store the final lines into a list:

否则,您可以将最后几行存储到一个list

final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("\n")]
print(final)

Output:

输出:

['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']

回答by Eliethesaiyan

I think nfn neil answer is great...but i would just add a simple regex to remove all no words character,however it will consider underscore as part of the word

我认为 nfn neil 的答案很棒……但我只想添加一个简单的正则表达式来删除所有没有单词的字符,但是它会将下划线视为单词的一部分

print  re.sub(r'\W+', ' ', string)
>>> hello there A Z R_T world welcome to python

回答by ssp4all

A more elegant solution would be

一个更优雅的解决方案是

print(re.sub(r"\W+|_", " ", string))

print(re.sub(r"\W+|_", " ", string))

>>> hello there A Z R T world welcome to python this should the next line followed by another million like this

>>> hello there A Z R T world welcome to python this should the next line followed by another million like this

Here, reis regexmodule in python

这里 reregexpython中的模块

re.subwill substitute pattern with space i.e., " "

re.sub将用空间替换模式,即, " "

r''will treat input string as raw (with \n)

r''将输入字符串视为原始字符串 (with \n)

\Wfor all non-words i.e. all special characters *&^%$ etc excluding underscore _

\W对于所有非单词,即所有特殊字符 *&^%$ 等,不包括下划线 _

+will match zero to unlimited matches, similar to * (one to more)

+将匹配零到无限匹配,类似于 *(一对多)

|is logical OR

|是逻辑 OR

_stands for underscore

_代表下划线

回答by wwii

Create a dictionary mapping special characters to None

创建一个将特殊字符映射到无的字典

d = {c:None for c in special_characters}

Make a translation tableusing the dictionary. Read the entire text into a variable and use str.translateon the entire text.

使用字典制作翻译表。将整个文本读入一个变量并对整个文本使用str.translate