Python 如何将文本拆分成句子？

Question

提问by Artyom

I have a text file. I need to get a list of sentences.

我有一个文本文件。我需要得到一个句子列表。

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

如何实施？有很多微妙之处，例如在缩写中使用点。

My old regular expression works badly:

我的旧正则表达式效果不佳：

re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)

Answer 1

采纳答案by Ned Batchelder

The Natural Language Toolkit (nltk.org) has what you need. This group postingindicates this does it:

Natural Language Toolkit ( nltk.org) 有您所需要的。此群组发布表明这样做：

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(I haven't tried it!)

（我没试过！）

Answer 2

回答by Rafe Kettler

For simple cases (where sentences are terminated normally), this should work:

对于简单的情况（句子正常终止），这应该有效：

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

The regex is *\. +, which matches a period surrounded by 0 or more spaces to the left and 1 or more to the right (to prevent something like the period in re.split being counted as a change in sentence).

正则表达式 is *\. +，它匹配左边有 0 个或多个空格和右边有 1 个或多个空格的句点（以防止像 re.split 中的句号这样的东西被算作句子的变化）。

Obviously, not the most robust solution, but it'll do fine in most cases. The only case this won't cover is abbreviations (perhaps run through the list of sentences and check that each string in sentencesstarts with a capital letter?)

显然，这不是最强大的解决方案，但在大多数情况下它会做得很好。这不会涵盖的唯一情况是缩写（也许遍历句子列表并检查每个字符串是否以sentences大写字母开头？）

Answer 3

回答by Marilena Di Bari

@Artyom,

@Artyom，

Hi! You could make a new tokenizer for Russian (and some other languages) using this function:

你好！您可以使用此功能为俄语（和其他一些语言）制作一个新的分词器：

def russianTokenizer(text):
    result = text
    result = result.replace('.', ' . ')
    result = result.replace(' .  .  . ', ' ... ')
    result = result.replace(',', ' , ')
    result = result.replace(':', ' : ')
    result = result.replace(';', ' ; ')
    result = result.replace('!', ' ! ')
    result = result.replace('?', ' ? ')
    result = result.replace('\"', ' \" ')
    result = result.replace('\'', ' \' ')
    result = result.replace('(', ' ( ')
    result = result.replace(')', ' ) ') 
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.strip()
    result = result.split(' ')
    return result

and then call it in this way:

然后以这种方式调用它：

text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)

Good luck, Marilena.

祝你好运，玛丽莲娜。

Answer 4

回答by vaichidrewar

No doubt that NLTK is the most suitable for the purpose. But getting started with NLTK is quite painful (But once you install it - you just reap the rewards)

毫无疑问，NLTK 最适合此目的。但是开始使用 NLTK 是非常痛苦的（但是一旦你安装了它 - 你就会获得回报）

So here is simple re based code available at http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html

所以这里有简单的基于 re 的代码，可在http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html 获得

# split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList


if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#output:
#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question

Answer 5

回答by TennisVisuals

Here is a middle of the road approach that doesn't rely on any external libraries. I use list comprehension to exclude overlaps between abbreviations and terminators as well as to exclude overlaps between variations on terminations, for example: '.' vs. '."'

这是一种不依赖任何外部库的中间方法。我使用列表理解来排除缩写词和终止符之间的重叠以及排除终止符变体之间的重叠，例如：'.' 与 '."'

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

I used Karl's find_all function from this entry: Find all occurrences of a substring in Python

我在这个条目中使用了 Karl 的 find_all 函数：在 Python 中查找所有出现的子字符串

Answer 6

回答by D Greenberg

This function can split the entire text of Huckleberry Finn into sentences in about 0.1 seconds and handles many of the more painful edge cases that make sentence parsing non-trivial e.g. "Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer. He also worked at craigslist.org as a business analyst."

此功能可以在大约 0.1 秒内将 Huckleberry Finn 的整个文本拆分成句子，并处理许多使句子解析变得不重要的更痛苦的边缘情况，例如“ John Johnson Jr. 先生出生在美国，但获得了博士学位。 D. 在加入耐克公司担任工程师之前在以色列工作。他还曾在 craigslist.org 担任业务分析师。”

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\1<prd>",text)
    text = re.sub(websites,"<prd>\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\1<stop> \2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\1<prd>\2<prd>\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\1<prd>\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \1<stop> \2",text)
    text = re.sub(" "+suffixes+"[.]"," \1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

Answer 7

回答by Hassan Raza

Instead of using regex for spliting the text into sentences, you can also use nltk library.

除了使用正则表达式将文本拆分成句子之外，您还可以使用 nltk 库。

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

ref: https://stackoverflow.com/a/9474645/2877052

参考：https: //stackoverflow.com/a/9474645/2877052

Answer 8

回答by Elf

You can try using Spacyinstead of regex. I use it and it does the job.

您可以尝试使用Spacy而不是正则表达式。我使用它，它可以完成工作。

import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())

Answer 9

回答by kishore

I had to read subtitles files and split them into sentences. After pre-processing (like removing time information etc in the .srt files), the variable fullFile contained the full text of the subtitle file. The below crude way neatly split them into sentences. Probably I was lucky that the sentences always ended (correctly) with a space. Try this first and if it has any exceptions, add more checks and balances.

我不得不阅读字幕文件并将它们分成句子。经过预处理（如删除 .srt 文件中的时间信息等），变量 fullFile 包含字幕文件的全文。下面的粗略方式将它们巧妙地分成句子。也许我很幸运，这些句子总是以空格结尾（正确地）。先试试这个，如果有任何例外，添加更多的制衡。

# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
    sentFile.write (line);
    sentFile.write ("\n");
sentFile.close;

Oh! well. I now realize that since my content was Spanish, I did not have the issues of dealing with "Mr. Smith" etc. Still, if someone wants a quick and dirty parser...

哦！好。我现在意识到，因为我的内容是西班牙语，所以我没有处理“史密斯先生”等的问题。不过，如果有人想要一个快速而肮脏的解析器......

Answer 10

回答by amiref

You can also use sentence tokenization function in NLTK:

您还可以在 NLTK 中使用句子标记化功能：

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare's quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,' and ‘to thine own self be true' are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)

Python 如何将文本拆分成句子？

提问by Artyom

采纳答案by Ned Batchelder

回答by Rafe Kettler

回答by Marilena Di Bari

回答by vaichidrewar

回答by TennisVisuals

回答by D Greenberg

回答by Hassan Raza

回答by Elf

回答by kishore

回答by amiref

相关推荐

最近更新

标签

Python 如何将文本拆分成句子？

提问by Artyom

采纳答案by Ned Batchelder

回答by Rafe Kettler

回答by Marilena Di Bari

回答by vaichidrewar

回答by TennisVisuals

回答by D Greenberg

回答by Hassan Raza

回答by Elf

回答by kishore

回答by amiref

相关推荐

Python Django - CSRF 验证失败

Python 如何从 gmtime() 的时间 + 日期输出中获取自纪元以来的秒数？

Python 在按键排序的字典中迭代键/值对

Python 何时使用 os.name、sys.platform 或 platform.system？

相关推荐

最近更新

标签