Python提取包含单词的句子

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16032832/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:38:53  来源:igfitidea点击:

Python extract sentence containing word

pythonregextext-segmentation

提问by user2187202

I am trying to extract all the sentence containing a specified word from a text.

我试图从文本中提取包含指定单词的所有句子。

txt="I like to eat apple. Me too. Let's go buy some apples."
txt = "." + txt
re.findall(r"\."+".+"+"apple"+".+"+"\.", txt)

but it is returning me :

但它让我回来了:

[".I like to eat apple. Me too. Let's go buy some apples."]

instead of :

代替 :

[".I like to eat apple., "Let's go buy some apples."]

Any help please ?

请问有什么帮助吗?

采纳答案by Kent

In [3]: re.findall(r"([^.]*?apple[^.]*\.)",txt)                                                                                                                             
Out[4]: ['I like to eat apple.', " Let's go buy some apples."]

回答by Adem ?zta?

You can use str.split,

您可以使用str.split

>>> txt="I like to eat apple. Me too. Let's go buy some apples."
>>> txt.split('. ')
['I like to eat apple', 'Me too', "Let's go buy some apples."]

>>> [ t for t in txt.split('. ') if 'apple' in t]
['I like to eat apple', "Let's go buy some apples."]

回答by unutbu

In [7]: import re

In [8]: txt=".I like to eat apple. Me too. Let's go buy some apples."

In [9]: re.findall(r'([^.]*apple[^.]*)', txt)
Out[9]: ['I like to eat apple', " Let's go buy some apples"]

But note that @jamylak's split-based solution is faster:

但请注意,@jamylak 的split基于解决方案的速度更快:

In [10]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
1000000 loops, best of 3: 1.96 us per loop

In [11]: %timeit [s+ '.' for s in txt.split('.') if 'apple' in s]
1000000 loops, best of 3: 819 ns per loop

The speed difference is less, but still significant, for larger strings:

对于较大的字符串,速度差异较小,但仍然很重要:

In [24]: txt = txt*10000

In [25]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
100 loops, best of 3: 8.49 ms per loop

In [26]: %timeit [s+'.' for s in txt.split('.') if 'apple' in s]
100 loops, best of 3: 6.35 ms per loop

回答by jamylak

No need for regex:

不需要正则表达式:

>>> txt = "I like to eat apple. Me too. Let's go buy some apples."
>>> [sentence + '.' for sentence in txt.split('.') if 'apple' in sentence]
['I like to eat apple.', " Let's go buy some apples."]

回答by poke

r"\."+".+"+"apple"+".+"+"\."

This line is a bit odd; why concatenate so many separate strings? You could just use r'..+apple.+.'.

这条线有点奇怪;为什么连接这么多单独的字符串?你可以只使用 r'..+apple.+.'。

Anyway, the problem with your regular expression is its greedy-ness. By default a x+will match xas often as it possibly can. So your .+will match as many characters (anycharacters) as possible; including dots and apples.

无论如何,正则表达式的问题在于它的贪婪性。默认情况下, ax+x尽可能多地匹配。因此,您.+将匹配尽可能多的字符(任何字符);包括点和apples。

What you want to use instead is a non-greedy expression; you can usually do this by adding a ?at the end: .+?.

你想要使用的是一个非贪婪的表达式;您通常可以通过?在末尾添加一个来做到这一点:.+?

This will make you get the following result:

这将使您得到以下结果:

['.I like to eat apple. Me too.']

As you can see you no longer get both the apple-sentences but still the Me too.. That is because you still match the .after the apple, making it impossible to not capture the following sentence as well.

正如你所看到的,你不再得到两个 apple-sentence 但仍然得到Me too.. 那是因为您仍然匹配.之后apple,因此不可能不捕获以下句子。

A working regular expression would be this: r'\.[^.]*?apple[^.]*?\.'

一个有效的正则表达式是这样的: r'\.[^.]*?apple[^.]*?\.'

Here you don't look at anycharacters, but only those characters which are not dots themselves. We also allow not to match any characters at all (because after the applein the first sentence there are no non-dot characters). Using that expression results in this:

在这里,您不看任何字符,而只看那些本身不是点的字符。我们还允许根本不匹配任何字符(因为在apple第一句中的之后没有非点字符)。使用该表达式会导致:

['.I like to eat apple.', ". Let's go buy some apples."]

回答by YJ. Yang

Obviously, the sample in question is extract sentence containing substringinstead of
extract sentence containing word. How to solve the extract sentence containing wordproblem through python is as follows:

显然,所讨论的样本extract sentence containing substring不是
extract sentence containing word。如何extract sentence containing word通过python解决问题如下:

A word can be in the begining|middle|end of the sentence. Not limited to the example in the question, I would provide a general function of searching a word in a sentence:

一个词可以在句子的开头|中间|结尾。不限于问题中的示例,我将提供在句子中搜索单词的通用功能:

def searchWordinSentence(word,sentence):
    pattern = re.compile(' '+word+' |^'+word+' | '+word+' $')
    if re.search(pattern,sentence):
        return True

limited to the example in the question, we can solve like:

仅限于问题中的示例,我们可以解决如下问题:

txt="I like to eat apple. Me too. Let's go buy some apples."
word = "apple"
print [ t for t in txt.split('. ') if searchWordofSentence(word,t)]

The corresponding output is:

对应的输出是:

['I like to eat apple']