如何从python中的文件中删除除空格之外的特殊字符？

Question

提问by pythonlearn

I have a huge corpus of text (line by line) and I want to remove special characters but sustain the space and structure of the string.

我有一个巨大的文本语料库（逐行），我想删除特殊字符，但要保留字符串的空间和结构。

hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.

should be

应该

hello there A Z R T world welcome to python
this should be the next line followed by another million like this

Answer 1

回答by Chiheb Nexus

You can use this pattern, too, with regex:

您也可以使用此模式regex：

import re
a = '''hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.'''

for k in a.split("\n"):
    print(re.sub(r"[^a-zA-Z0-9]+", ' ', k))
    # Or:
    # final = " ".join(re.findall(r"[a-zA-Z0-9]+", k))
    # print(final)

Output:

输出：

hello there A Z R T world welcome to python 
this should the next line followed by an other million like this

Edit:

编辑：

Otherwise, you can store the final lines into a list:

否则，您可以将最后几行存储到一个list：

final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("\n")]
print(final)

Output:

输出：

['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']

Answer 2

回答by Eliethesaiyan

I think nfn neil answer is great...but i would just add a simple regex to remove all no words character,however it will consider underscore as part of the word

我认为 nfn neil 的答案很棒……但我只想添加一个简单的正则表达式来删除所有没有单词的字符，但是它会将下划线视为单词的一部分

print  re.sub(r'\W+', ' ', string)
>>> hello there A Z R_T world welcome to python

Answer 3

回答by ssp4all

A more elegant solution would be

一个更优雅的解决方案是

print(re.sub(r"\W+|_", " ", string))

>>> hello there A Z R T world welcome to python this should the next line followed by another million like this

Here, reis regexmodule in python

这里 re是regexpython中的模块

re.subwill substitute pattern with space i.e., " "

re.sub将用空间替换模式，即， " "

r''will treat input string as raw (with \n)

r''将输入字符串视为原始字符串 (with \n)

\Wfor all non-words i.e. all special characters *&^%$ etc excluding underscore _

\W对于所有非单词，即所有特殊字符 *&^%$ 等，不包括下划线 _

+will match zero to unlimited matches, similar to * (one to more)

+将匹配零到无限匹配，类似于 *（一对多）

|is logical OR

|是逻辑 OR

_stands for underscore

_代表下划线

Answer 4

回答by wwii

Create a dictionary mapping special characters to None

创建一个将特殊字符映射到无的字典

d = {c:None for c in special_characters}

Make a translation tableusing the dictionary. Read the entire text into a variable and use str.translateon the entire text.

使用字典制作翻译表。将整个文本读入一个变量并对整个文本使用str.translate。

如何从python中的文件中删除除空格之外的特殊字符？

提问by pythonlearn

回答by Chiheb Nexus

回答by Eliethesaiyan

回答by ssp4all

回答by wwii

相关推荐

最近更新

标签

如何从python中的文件中删除除空格之外的特殊字符？

提问by pythonlearn

回答by Chiheb Nexus

回答by Eliethesaiyan

回答by ssp4all

回答by wwii

相关推荐

Python 有效地将一列中的值替换为另一列 Pandas DataFrame

Python 如何让 Keras 在 Anaconda 中使用 Tensorflow 后端？

Python 循环张量

Python Spyder 3“设置控制台工作目录”不起作用

相关推荐

最近更新

标签