Python 正则表达式查找文本的所有句子?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3549075/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:41:07  来源:igfitidea点击:

Regex to find all sentences of text?

pythonregex

提问by sarevok

I have been trying to teach myself Regexes in python and I decided to print out all the sentences of a text. I have been tinkering with the regular expressions for the past 3 hours to no avail.

我一直在尝试在 python 中自学正则表达式,我决定打印出文本的所有句子。在过去的 3 个小时里,我一直在修改正则表达式,但无济于事。

I just tried the following but couldn't do anything.

我只是尝试了以下但无法做任何事情。

p = open('anan.txt')
process = p.read()
regexMatch = re.findall('^[A-Z].+\s+[.!?]$',process,re.I)
print regexMatch
p.close()

My input file is like this:

我的输入文件是这样的:

OMG is this a question ! Is this a sentence ? My.
name is.

This prints no outputs. But when I remove "My. name is.", it prints OMG is this a question and Is this a sentence together as if it only reads the first line.

这不打印输出。但是当我删除“我的名字是。”时,它会打印 OMG is this a question 和 Is this a sentence together,就好像它只读取第一行一样。

What is the best solution of regex that can find all sentences in a text file - regardless if the sentence carries to new line or so - and also reads the entire text? Thanks.

可以在文本文件中找到所有句子的正则表达式的最佳解决方案是什么 - 无论句子是否换行左右 - 并且还读取整个文本?谢谢。

采纳答案by Jochen Ritzel

Something like this works:

像这样的工作:

## pattern: Upercase, then anything that is not in (.!?), then one of them
>>> pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
>>> pat.findall('OMG is this a question ! Is this a sentence ? My. name is.')
['OMG is this a question !', 'Is this a sentence ?', 'My.']

Notice how name is.is not in the result because it does not start with a uppercase letter.

请注意 howname is.不在结果中,因为它不是以大写字母开头。

Your problem comes from the use of the ^$anchors, they work on the whole text.

您的问题来自^$锚点的使用,它们适用于整个文本。

回答by Arslan

I tried on Notepad++, and I got this :

我在 Notepad++ 上试过,我得到了这个:

.*$

And activate the multiline option :

并激活多行选项:

re.MULTILINE

re.MULTILINE

Cheers

干杯

回答by Aaron Digulla

Try the other way around: Split the text at sentence boundaries.

尝试另一种方式:在句子边界拆分文本。

lines = re.split(r'\s*[!?.]\s*', text)

If that doesn't work, add a \before the ..

如果这不起作用,请\..

回答by Daniel Vandersluis

There are two issues in your regex:

您的正则表达式中有两个问题:

  1. Your expression is anchoredby ^and $, which are the "start of line" and "end of line" anchors, respectively. That means that your pattern is looking to match an entire line of your text.
  2. You are searching for \s+before your punctuation character, which specifies one or morewhitespace character. If you don't have whitespace before your punctuation, the expression will not match.
  1. 您的表达式由和锚定,它们分别是“行首”和“行尾”锚点。这意味着您的模式希望匹配整行文本。^$
  2. \s+在标点符号之前搜索,它指定一个或多个空白字符。如果标点符号前没有空格,则表达式将不匹配。

回答by cji

Edited:now it will work with multiline sentences too.

编辑:现在它也适用于多行句子。

>>> t = "OMG is this a question ! Is this a sentence ? My\n name is."
>>> re.findall("[A-Z].*?[\.!?]", t, re.MULTILINE | re.DOTALL )
['OMG is this a question !', 'Is this a sentence ?', 'My\n name is.']

Only one thing left to explain - re.DOTALLmakes .match newline as described here

只剩下一件事需要解释 -按照此处所述re.DOTALL进行.匹配换行

回答by codaddict

You can try:

你可以试试:

p = open('a')
process = p.read()
print process
regexMatch = re.findall('[^.!?]+[.!?]',process)
print regexMatch
p.close()

The regex used here is [^.!?]+[.!?]which tries to match one or more non-sentence delimiter followed by a sentence delimiter.

这里使用的正则表达式[^.!?]+[.!?]试图匹配一个或多个非句子定界符,后跟一个句子定界符。

回答by Ningrong Ye

Thank you cji and Jochen Ritzel.

谢谢 cji 和 Jochen Ritzel。

sentence=re.compile("[A-Z].*?[\.!?] ", re.MULTILINE | re.DOTALL )

I think this is the best, just add a space at the end.

我认为这是最好的,只需在最后添加一个空格。

 SampleReport='I image from 08/25 through 12. The patient image 1.2, 23, 34, 45 and 64 from serise 34. image look good to have a tumor in this area.  It has been resected during the interval between scans.  The'

if use

如果使用

pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
pat.findall(SampleReport)

The result will be:

结果将是:

['I image from 08/25 through 12.',
'The patient image 1.',
 'It has been resected during the interval between scans.']

The bug is it can't handle digit like 1.2. But this one works perfectly.

错误是它无法处理像 1.2 这样的数字。但这个完美地工作。

sentence.findall(SampleReport)

Result

结果

['I image from 08/25 through 12. ',
'The patient image 1.2, 23, 34, 45 and 64 from serise 34. ',
 'It has been resected during the interval between scans. ']