用正则表达式去除标点符号 - python
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18429143/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
strip punctuation with regex - python
提问by user2696287
I need to use regex to strip punctuation at the startand endof a word. It seems like regex would be the best option for this. I don't want punctuation removed from words like 'you're', which is why I'm not using .replace().
我需要使用正则表达式去除单词开头和结尾的标点符号。似乎正则表达式将是最好的选择。我不想从像“you're”这样的词中删除标点符号,这就是我不使用 .replace() 的原因。
采纳答案by falsetru
You don't need regular expression to do this task. Use str.strip
with string.punctuation
:
您不需要正则表达式来完成此任务。使用str.strip
有string.punctuation
:
>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'
>>> '!Hello.'.strip(string.punctuation)
'Hello'
>>> ' '.join(word.strip(string.punctuation) for word in "Hello, world. I'm a boy, you're a girl.".split())
"Hello world I'm a boy you're a girl"
回答by rahul ranjan
You can remove punctuation from a text file or a particular string file using regular expression as follows -
您可以使用正则表达式从文本文件或特定字符串文件中删除标点符号,如下所示 -
new_data=[]
with open('/home/rahul/align.txt','r') as f:
f1 = f.read()
f2 = f1.split()
all_words = f2
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
# You can add and remove punctuations as per your choice
#removing stop words in hungarian text and english text and
#display the unpunctuated string
# To remove from a string, replace new_data with new_str
# new_str = "My name$#@ is . rahul -~"
for word in all_words:
if word not in punctuations:
new_data.append(word)
print (new_data)
P.S. - Do the identation properly as per required. Hope this helps!!
PS - 按要求正确进行识别。希望这可以帮助!!
回答by Shalini Baranwal
I think this function will be helpful and concise in removing punctuation:
我认为此功能在删除标点符号方面会有所帮助且简洁:
import re
def remove_punct(text):
new_words = []
for word in text:
w = re.sub(r'[^\w\s]','',word) #remove everything except words and space#how
#to remove underscore as well
w = re.sub(r'\_','',w)
new_words.append(w)
return new_words