Python正则表达式匹配整个单词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15863066/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python regular expression match whole word
提问by user2161049
I'm having trouble finding the correct regular expression for the scenario below:
我无法为以下场景找到正确的正则表达式:
Lets say:
让我们说:
a = "this is a sample"
I want to match whole word - for example match "hi"should return False since "hi"is not a word and "is"should return True since there is no alpha character on the left and on the right side.
我想匹配整个单词 - 例如 match"hi"应该返回 False 因为"hi"它不是一个单词并且"is"应该返回 True 因为左侧和右侧没有字母字符。
采纳答案by georg
Try
尝试
re.search(r'\bis\b', your_string)
From the docs:
从文档:
\b Matches the empty string, but only at the beginning or end of a word.
\b 匹配空字符串,但只在单词的开头或结尾。
Note that the remodule uses a naive definition of "word" as a "sequence of alphanumeric or underscore characters", where "alphanumeric" depends on locale or unicode options.
请注意,该re模块使用“单词”的天真定义作为“字母数字或下划线字符的序列”,其中“字母数字”取决于语言环境或 unicode 选项。
Also note that without the raw string prefix, \bis seen as "backspace" instead of regex word boundary.
另请注意,如果没有原始字符串前缀,\b则被视为“退格”而不是正则表达式单词边界。
回答by keir
The trouble with regex is that if hte string you want to search for in another string has regex characters it gets complicated. any string with brackets will fail.
正则表达式的问题在于,如果您想在另一个字符串中搜索的字符串包含正则表达式字符,它会变得复杂。任何带括号的字符串都会失败。
This code will find a word
这段代码会找到一个词
word="is"
srchedStr="this is a sample"
if srchedStr.find(" "+word+" ") >=0 or \
srchedStr.endswith(" "+word):
<do stuff>
The first part of the conditional searches for the text with a space on each side and the second part catches the end of string situation. Note that the endwith is boolean whereas the findreturns an integer
条件的第一部分搜索每边有一个空格的文本,第二部分捕获字符串结尾的情况。请注意, endwith 是布尔值,而find返回一个整数
回答by Om Prakash
Try using the "word boundary" character class in the regex module, re:
尝试在正则表达式模块中使用“词边界”字符类,re:
x="this is a sample"
y="this isis a sample."
regex=re.compile(r"\bis\b") # For ignore case: re.compile(r"\bis\b", re.IGNORECASE)
regex.findall(y)
[]
regex.findall(x)
['is']
From the documentation of re.search().
来自re.search().
\bmatches the empty string, but only at the beginning or end of a word...
For example
r'\bfoo\b'matches'foo','foo.','(foo)','bar foo baz'but not'foobar'or'foo3'
\b匹配空字符串,但只在单词的开头或结尾...
例如
r'\bfoo\b'匹配'foo','foo.','(foo)','bar foo baz'但不匹配'foobar'或'foo3'
回答by bballdave025
I think that the behavior desired by the OP was not completely achieved using the answers given. Specifically, the desired output of a boolean was not accomplished. The answers given dohelp illustrate the concept, and I think they are excellent. Perhaps I can illustrate what I mean by stating that I think that the OP used the examples used because of the following.
我认为使用给出的答案并没有完全实现 OP 所需的行为。具体来说,未完成布尔值的预期输出。给出的答案确实有助于说明这个概念,我认为它们非常好。也许我可以通过说明我认为 OP 使用所使用的示例来说明我的意思,因为以下内容。
The string given was,
给出的字符串是,
a = "this is a sample"
a = "this is a sample"
The OP then stated,
OP随后表示,
I want to match whole word - for example match
"hi"should returnFalsesince"hi"is not a word ...
我想匹配整个单词 - 例如 match
"hi"应该返回,False因为"hi"它不是一个单词......
As I understand, the reference is to the search token, "hi"as it is found in the word, "this". If someone were to search the string, afor the word"hi", they should receive Falseas the response.
据我了解,引用是对搜索标记的引用,"hi"因为它在单词"this". 如果有人要搜索字符串,a对于单词"hi",他们应该收到False作为响应。
The OP continues,
OP继续,
... and
"is"should returnTruesince there is no alpha character on the left and on the right side.
...并且
"is"应该返回,True因为左侧和右侧没有字母字符。
In this case, the reference is to the search token "is"as it is found in the word "is". I hope this helps clarify things as to why we use word boundaries. The other answers have the behavior of "don't return a word unless that word is found by itself -- not inside of other words." The "word boundary" shorthand character classdoes this job nicely.
在这种情况下,引用是"is"在单词 中找到的搜索标记"is"。我希望这有助于澄清我们为什么使用单词边界。其他答案的行为是“除非该词本身被找到,否则不要返回一个词——而不是在其他词中。” “单词边界”速记字符类很好地完成了这项工作。
Only the word "is"has been used in examples up to this point. I think that these answers are correct, but I think that there is more of the question's fundamental meaning that needs to be addressed. The behavior of other search strings should be noted to understand the concept. In other words, we need to generalizethe (excellent) answer by @georg using re.match(r"\bis\b", your_string)The same r"\bis\b"concept is also used in the answer by @OmPrakash, who started the generalizing discussion by showing
"is"到目前为止,仅在示例中使用了这个词。我认为这些答案是正确的,但我认为还有更多问题的基本含义需要解决。应注意其他搜索字符串的行为以理解该概念。换句话说,我们需要使用@georg 的(优秀的)答案来概括@OmPrakash的答案中也使用了re.match(r"\bis\b", your_string)相同的r"\bis\b"概念,他通过展示开始了概括性讨论
>>> y="this isis a sample." >>> regex=re.compile(r"\bis\b") # For ignore case: re.compile(r"\bis\b", re.IGNORECASE) >>> regex.findall(y) []
>>> y="this isis a sample." >>> regex=re.compile(r"\bis\b") # For ignore case: re.compile(r"\bis\b", re.IGNORECASE) >>> regex.findall(y) []
Let's say the method which should exhibit the behavior I've discussed is named
假设应该展示我讨论过的行为的方法被命名为
find_only_whole_word(search_string, input_string)
The following behavior should then be expected.
然后应该会出现以下行为。
>>> a = "this is a sample"
>>> find_only_whole_word("hi", a)
False
>>> find_only_whole_word("is", a)
True
Once again, this is how I understand the OP's question. We have a step towards that behavior with the answer from @georg , but it's a little hard to interpret/implement. to wit
再一次,这就是我理解 OP 问题的方式。通过@georg 的回答,我们朝着这种行为迈出了一步,但解释/实施有点困难。以机智
>>> import re
>>> a = "this is a sample"
>>> re.search(r"\bis\b", a)
<_sre.SRE_Match object; span=(5, 7), match='is'>
>>> re.search(r"\bhi\b", a)
>>>
There is no output from the second command. The useful answer from @OmPrakesh shows output, but not Trueor False.
第二个命令没有输出。@OmPrakesh 的有用答案显示输出,但不显示True或False。
Here's a more complete sampling of the behavior to be expected.
这是预期行为的更完整示例。
>>> find_only_whole_word("this", a)
True
>>> find_only_whole_word("is", a)
True
>>> find_only_whole_word("a", a)
True
>>> find_only_whole_word("sample", a)
True
# Use "ample", part of the word, "sample": (s)ample
>>> find_only_whole_word("ample", a)
False
# (t)his
>>> find_only_whole_word("his", a)
False
# (sa)mpl(e)
>>> find_only_whole_word("mpl", a)
False
# Any random word
>>> find_only_whole_word("applesauce", a)
False
>>>
This can be accomplished by the following code:
这可以通过以下代码完成:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
#@file find_only_whole_word.py
import re
def find_only_whole_word(search_string, input_string):
# Create a raw string with word boundaries from the user's input_string
raw_search_string = r"\b" + search_string + r"\b"
match_output = re.search(raw_search_string, input_string)
##As noted by @OmPrakesh, if you want to ignore case, uncomment
##the next two lines
#match_output = re.search(raw_search_string, input_string,
# flags=re.IGNORECASE)
no_match_was_found = ( match_output is None )
if no_match_was_found:
return False
else:
return True
##endof: find_only_whole_word(search_string, input_string)
A simple demonstration follows. Run the Python interpreter from the same directory where you saved the file, find_only_whole_word.py.
下面是一个简单的演示。从保存文件的同一目录运行 Python 解释器,find_only_whole_word.py.
>>> from find_only_whole_word import find_only_whole_word
>>> a = "this is a sample"
>>> find_only_whole_word("hi", a)
False
>>> find_only_whole_word("is", a)
True
>>> find_only_whole_word("cucumber", a)
False
# The excellent example from @OmPrakash
>>> find_only_whole_word("is", "this isis a sample")
False
>>>

