Python - 用于将文本拆分为句子的正则表达式(句子标记化)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25735644/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:34:32  来源:igfitidea点击:

Python - RegEx for splitting text into sentences (sentence-tokenizing)

pythonregexnlptokenize

提问by user3590149

I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence has a .com This is attempt at regex that doesn't work.

我想从一个字符串中创建一个句子列表,然后将它们打印出来。我不想使用 NLTK 来做到这一点。因此,它需要在句子末尾的句点上拆分,而不是在小数点或缩写或名称的标题上拆分,或者如果句子有 .com,这是对正则表达式的尝试不起作用。

import re

text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

for stuff in sentences:
        print(stuff)    

Example output of what it should look like

它应该是什么样子的示例输出

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. 
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.

采纳答案by vks

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s

Try this. split your string this.You can also check demo.

尝试这个。拆分你的字符串 this.You 也可以检查演示。

http://regex101.com/r/nG1gU7/27

http://regex101.com/r/nG1gU7/27

回答by Jose Varez

If you want to break up sentences at 3 periods (not sure if this is what you want) you can use this regular expresion:

如果你想把句子分成 3 个句点(不确定这是否是你想要的),你可以使用这个正则表达式:

import re

text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r'\.{3}', text)

for stuff in sentences:
     print(stuff)    
import re

text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r'\.{3}', text)

for stuff in sentences:
     print(stuff)    

回答by smci

Ok so sentence-tokenizers are something I looked at in a little detail, using regexes, nltk, CoreNLP. You end up writing your own and it depends on the application. This stuff is tricky and valuable and people don't just give their tokenizer code away. (Ultimately, tokenization is not a deterministic procedure, it's probabilistic, and also depends very heavily on your corpus or domain, e.g. legal/financial documents vs social-media posts vs Yelp reviews vs biomedical papers...)

好吧,我使用正则表达式、nltk、CoreNLP 对句子标记器进行了详细研究。您最终会自己编写,这取决于应用程序。这些东西既棘手又有价值,人们不只是将他们的标记器代码泄露出去。(归根结底,标记化不是确定性的程序,它是概率性的,而且在很大程度上取决于您的语料库或领域,例如法律/财务文件、社交媒体帖子、Yelp 评论、生物医学论文……)

In general you can't rely on one single Great White infallible regex, you have to write a function which uses several regexes (both positive and negative); also a dictionary of abbreviations, and some basic language parsing which knows that e.g. 'I', 'USA', 'FCC', 'TARP' are capitalized in English.

一般来说,你不能依赖一个单一的 Great White 无误的正则表达式,你必须编写一个使用多个正则表达式(正则和负)的函数;还有一个缩写字典,以及一些基本的语言解析,它知道例如“I”、“USA”、“FCC”、“TARP”在英语中是大写的。

To illustrate how easily this can get seriously complicated, let's try to write you that functional spec for a deterministic tokenizer justto decide whether single or multiple period ('.'/'...') indicates end-of-sentence, or something else:

为了说明如何轻松地这样做会变得非常麻烦,让我们试着给你写了一个确定性的标记生成器功能性规范只是决定是否单个或多个周期(“‘/’...”)表示结束句,或一些别的:

function isEndOfSentence(leftContext, rightContext)

function isEndOfSentence(leftContext, rightContext)

  1. Return False for decimals inside numbers or currency e.g. 1.23 , $1.23, "That's just my $.02"Consider also section references like 1.2.3, European date formats like 09.07.2014, IP addresses like 192.168.1.1, MAC addresses...
  2. Return False (and don't tokenize into individual letters) for known abbreviations e.g. "U.S. stocks are falling" ; this requires a dictionary of known abbreviations. Anything outside that dictionary you will get wrong, unless you add code to detect unknown abbreviations like A.B.C. and add them to a list.
  3. Ellipses '...' at ends of sentences are terminal, but in the middle of sentences are not. This is not as easy as you might think: you need to look at the left context and the right context, specifically is the RHS capitalized and again consider capitalized words like 'I' and abbreviations. Here's an example proving ambiguity which : She asked me to stay... I left an hour later.(Was that one sentence or two? Impossible to determine)
  4. You may also want to write a few patterns to detect and reject miscellaneous non-sentence-ending uses of punctuation: emoticons :-), ASCII art, spaced ellipses . . . and other stuff esp. Twitter. (Making that adaptive is even harder). How do we tell if @midnight is a Twitter user, the show on Comedy Central, text shorthand, or simply unwanted/junk/typo punctuation? Seriously non-trivial.
  5. After you handle all those negative cases, you could arbitrarily say that any isolated period followed by whitespace is likely to be an end of sentence. (Ultimately, if you really want to buy extra accuracy, you end up writing your own probabilistic sentence-tokenizer which uses weights, and training it on a specific corpus(e.g. legal texts, broadcast media, StackOverflow, Twitter, forums comments etc.)) Then you have to manually review exemplars and training errors. See Manning and Jurafsky book or Coursera course [a]. Ultimately you get as much correctness as you are prepared to pay for.
  6. All of the above is clearly specific to the English-language/ abbreviations, US number/time/date formats. If you want to make it country- and language-independent, that's a bigger proposition, you'll need corpora, native-speaking people to label and QA it all, etc.
  7. All of the above is still only ASCII. Allow the input to be Unicode, and things get harder still (and the training-set necessarily must be either much bigger or much sparser)
  1. 为数字或货币中的小数返回 False,例如1.23 ,$1.23,“那只是我的 $.02”还考虑部分引用,如 1.2.3,欧洲日期格式如 09.07.2014,IP 地址如 192.168.1.1,MAC 地址...
  2. 对于已知的缩写,例如“美国股票正在下跌”,返回 False(并且不要标记为单个字母);这需要一本已知缩写的字典。除非您添加代码来检测未知缩写(如 ABC)并将它们添加到列表中,否则该词典之外的任何内容都会出错。
  3. 句尾的省略号“...”是终结符,但句中的省略号不是。这并不像您想象的那么容易:您需要查看左侧上下文和右侧上下文,特别是 RHS 大写,并再次考虑大写单词,如“I”和缩写。这是一个证明模棱两可的例子:她让我留下来......我在一个小时后离开了。(是一两句话?无法确定)
  4. 您可能还想编写一些模式来检测和拒绝标点符号的各种非句子结尾使用:表情符号 :-)、ASCII 艺术、间隔省略号。. . 和其他东西,尤其是。推特。(使这种适应性更难)。我们如何判断@midnight 是 Twitter 用户、Comedy Central 上的节目、文本速记还是仅仅是不需要的/垃圾/打字错误的标点符号?严重不平凡。
  5. 在处理完所有这些负面案例后,您可以随意说任何一个孤立的句点后跟空格都可能是句子的结尾。(最终,如果您真的想购买额外的准确性,您最终会编写自己的使用权重的概率句子标记器,并在特定语料库(例如法律文本、广播媒体、StackOverflow、Twitter、论坛评论等)上进行训练) 然后你必须手动检查样本和训练错误。参见 Manning 和 Jurafsky 的书或 Coursera 课程 [a]。最终,您将获得尽可能多的正确性,因为您愿意为此付出代价。
  6. 以上所有内容显然特定于英语/缩写、美国数字/时间/日期格式。如果你想让它独立于国家和语言,这是一个更大的命题,你需要语料库、母语人士来标记和质量检查等等。
  7. 以上所有仍然只是ASCII。允许输入为 Unicode,事情变得更加困难(并且训练集必须要么更大要么更稀疏)

In the simple (deterministic) case, function isEndOfSentence(leftContext, rightContext)would return boolean, but in the more general sense, it's probabilistic: it returns a float 0.0-1.0 (confidence level that that particular '.' is a sentence end).

在简单(确定性)情况下,function isEndOfSentence(leftContext, rightContext)将返回布尔值,但在更一般的意义上,它是概率性的:它返回一个浮点数 0.0-1.0(该特定 '.' 是句子结尾的置信度)。

References: [a] Coursera video: "Basic Text Processing 2-5 - Sentence Segmentation - Stanford NLP - Professor Dan Jurafsky & Chris Manning" [UPDATE: an unofficial version used to be on YouTube, was taken down]

参考文献:[a] Coursera 视频:“基本文本处理 2-5 - 句子分割 - 斯坦福 NLP - 丹·朱拉夫斯基教授和克里斯·曼宁教授” [更新:YouTube 上曾经有一个非官方版本,已被删除]

回答by walid toumi

Try this:

尝试这个:

(?<!\b(?:[A-Z][a-z]|\d|[i.e]))\.(?!\b(?:com|\d+)\b)

回答by Ali

Naive approach for proper english sentences not starting with non-alphas and not containing quoted parts of speech:

不以非字母开头且不包含引述词性的正确英语句子的幼稚方法:

import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
NonEndings = re.compile(r'(?:Mrs?|Jr|i\.e)\.\s*$')
parts = EndPunctuation.split(text)
sentence = []
for part in parts:
  if len(part) and len(sentence) and EndPunctuation.match(sentence[-1]) and not NonEndings.search(''.join(sentence)):
    print(''.join(sentence))
    sentence = []
  if len(part):
    sentence.append(part)
if len(sentence):
  print(''.join(sentence))

False positive splitting may be reduced by extending NonEndings a bit. Other cases will require additional code. Handling typos in a sensible way will prove difficult with this approach.

通过稍微扩展 NonEndings 可以减少误报分裂。其他情况将需要额外的代码。使用这种方法以明智的方式处理错别字将被证明是困难的。

You will never reach perfection with this approach. But depending on the task it might just work "enough"...

用这种方法你永远不会达到完美。但是根据任务,它可能只是“足够”工作......

回答by Avinash Raj

Try to split the input according to the spaces rather than a dot or ?, if you do like this then the dot or ?won't be printed in the final result.

尝试根据空格而不是点或 分割输入?,如果您这样做,则点 or?将不会打印在最终结果中。

>>> import re
>>> s = """Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."""
>>> m = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', s)
>>> for i in m:
...     print i
... 
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.

回答by TennisVisuals

I wrote this taking into consideration smci's comments above. It is a middle-of-the-road approach that doesn't require external libraries and doesn't use regex. It allows you to provide a list of abbreviations and accounts for sentences ended by terminators in wrappers, such as a period and quote: [.", ?', .)].

我写这个是考虑到 smci 上面的评论。这是一种不需要外部库且不使用正则表达式的中间方法。它允许您为包装器中以终止符结尾的句子提供缩写和帐户列表,例如句点和引号:[.", ?', .)]。

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior', 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

I used Karl's find_all function from this entry: Find all occurrences of a substring in Python

我在这个条目中使用了 Karl 的 find_all 函数:在 Python 中查找所有出现的子字符串

回答by Mehul Gupta

sent = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)',text)
for s in sent:
    print s

Here the regex used is : (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)

这里使用的正则表达式是: (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)

First block: (?<!\w\.\w.): this pattern searches in a negative feedback loop (?<!)for all words (\w)followed by fullstop (\.), followed by other words (\.)

第一个块::(?<!\w\.\w.)此模式在负反馈循环中(?<!)搜索所有单词,(\w)然后是句号(\.),然后是其他单词(\.)

Second block: (?<![A-Z][a-z]\.): this pattern searches in a negative feedback loop for anything starting with uppercase alphabets ([A-Z]), followed by lower case alphabets ([a-z])till a dot (\.)is found.

第二个块::(?<![A-Z][a-z]\.)此模式在负反馈循环中搜索以大写字母开头的任何内容([A-Z]),然后是小写字母,([a-z])直到(\.)找到一个点。

Third block: (?<=\.|\?): this pattern searches in a feedback loop of dot (\.)OR question mark (\?)

第三块::(?<=\.|\?)此模式在点(\.)或问号的反馈循环中搜索(\?)

Fourth block: (\s|[A-Z].*): this pattern searches after the dot OR question mark from the third block. It searches for blank space (\s)OR any sequence of characters starting with a upper case alphabet ([A-Z].*). This block is important to split if the input is as

第四块:(\s|[A-Z].*)这个模式在第三块的点或问号之后搜索。它搜索空格(\s)或任何以大写字母开头的字符序列([A-Z].*)。如果输入为,则此块对于拆分很重要

Hello world.Hi I am here today.

世界你好,你好,我今天在这里。

i.e. if there is space or no space after the dot.

即点后是否有空格或没有空格。

回答by Priyank Pathak

I'm not great at regular expressions, but a simpler version, "brute force" actually, of above is

我不擅长正则表达式,但上面的一个更简单的版本,“蛮力”实际上是

sentence = re.compile("([\'\"][A-Z]|([A-Z][a-z]*\. )|[A-Z])(([a-z]*\.[a-z]*\.)|([A-Za-z0-9]*\.[A-Za-z0-9])|([A-Z][a-z]*\. [A-Za-z]*)|[^\.?]|[A-Za-z])*[\.?]")

which means start acceptable units are '[A-Z] or "[A-Z]
please note, most regular expressions are greedy so the order is very important when we do |(or). That's, why I have written i.e.regular expression first, then is come forms like Inc.

该装置开始接受单位是“[AZ]或“[AZ]
请注意,大多数的正则表达式是贪婪的,所以当我们做的顺序是非常重要|(或)这就是为什么我写第一个正则表达式,然后是像Inc.这样的形式出现

回答by Luiz Anísio

My example is based on the example of Ali, adapted to Brazilian Portuguese. Thanks Ali.

我的例子是基于阿里的例子,改编为巴西葡萄牙语。谢谢阿里。

ABREVIACOES = ['sra?s?', 'exm[ao]s?', 'ns?', 'nos?', 'doc', 'ac', 'publ', 'ex', 'lv', 'vlr?', 'vls?',
               'exmo(a)', 'ilmo(a)', 'av', 'of', 'min', 'livr?', 'co?ls?', 'univ', 'resp', 'cli', 'lb',
               'dra?s?', '[a-z]+r\(as?\)', 'ed', 'pa?g', 'cod', 'prof', 'op', 'plan', 'edf?', 'func', 'ch',
               'arts?', 'artigs?', 'artg', 'pars?', 'rel', 'tel', 'res', '[a-z]', 'vls?', 'gab', 'bel',
               'ilm[oa]', 'parc', 'proc', 'adv', 'vols?', 'cels?', 'pp', 'ex[ao]', 'eg', 'pl', 'ref',
               '[0-9]+', 'reg', 'f[ilí]s?', 'inc', 'par', 'alin', 'fts', 'publ?', 'ex', 'v. em', 'v.rev']

ABREVIACOES_RGX = re.compile(r'(?:{})\.\s*$'.format('|\s'.join(ABREVIACOES)), re.IGNORECASE)

        def sentencas(texto, min_len=5):
            # baseado em https://stackoverflow.com/questions/25735644/python-regex-for-splitting-text-into-sentences-sentence-tokenizing
            texto = re.sub(r'\s\s+', ' ', texto)
            EndPunctuation = re.compile(r'([\.\?\!]\s+)')
            # print(NonEndings)
            parts = EndPunctuation.split(texto)
            sentencas = []
            sentence = []
            for part in parts:
                txt_sent = ''.join(sentence)
                q_len = len(txt_sent)
                if len(part) and len(sentence) and q_len >= min_len and \
                        EndPunctuation.match(sentence[-1]) and \
                        not ABREVIACOES_RGX.search(txt_sent):
                    sentencas.append(txt_sent)
                    sentence = []

                if len(part):
                    sentence.append(part)
            if sentence:
                sentencas.append(''.join(sentence))
            return sentencas

Full code in: https://github.com/luizanisio/comparador_elastic

完整代码:https: //github.com/luizanisio/comparador_elastic