Python BeautifulSoup - 按标签内的文本搜索
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31958637/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
BeautifulSoup - search by text inside a tag
提问by Eldamir
Observe the following problem:
观察以下问题:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
# This returns the <a> element
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
# This returns None
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
For some reason, BeautifulSoup will not match the text, when the <i>
tag is there as well. Finding the tag and showing its text produces
出于某种原因,当<i>
标签也在那里时,BeautifulSoup 不会匹配文本。查找标签并显示其文本会产生
>>> a2 = soup.find(
'a',
href="/customer-menu/1/accounts/1/update"
)
>>> print(repr(a2.text))
'\n Edit\n'
Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:
对。根据Docs,汤使用正则表达式的匹配功能,而不是搜索功能。所以我需要提供 DOTALL 标志:
pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n') # Returns None
pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n') # Returns MatchObject
Alright. Looks good. Let's try it with soup
好吧。看起来挺好的。用汤试试吧
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*", flags=re.DOTALL)
) # Still return None... Why?!
Edit
编辑
My solution based on geckons answer: I implemented these helpers:
我基于壁虎的解决方案回答:我实现了这些助手:
import re
MATCH_ALL = r'.*'
def like(string):
"""
Return a compiled regular expression that matches the given
string with any prefix and postfix, e.g. if string = "hello",
the returned regex matches r".*hello.*"
"""
string_ = string
if not isinstance(string_, str):
string_ = str(string_)
regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
return re.compile(regex, flags=re.DOTALL)
def find_by_text(soup, text, tag, **kwargs):
"""
Find the tag in soup that matches all provided kwargs, and contains the
text.
If no match is found, return None.
If more than one match is found, raise ValueError.
"""
elements = soup.find_all(tag, **kwargs)
matches = []
for element in elements:
if element.find(text=like(text)):
matches.append(element)
if len(matches) > 1:
raise ValueError("Too many matches:\n" + "\n".join(matches))
elif len(matches) == 0:
return None
else:
return matches[0]
Now, when I want to find the element above, I just run find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')
现在,当我想找到上面的元素时,我只需运行 find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')
采纳答案by geckon
The problem is that your <a>
tag with the <i>
tag inside, doesn't have the string
attribute you expect it to have. First let's take a look at what text=""
argument for find()
does.
问题是您<a>
的<i>
标签内部带有标签,没有string
您期望的属性。首先让我们看一下text=""
参数 for 的find()
作用。
NOTE: The text
argument is an old name, since BeautifulSoup 4.4.0 it's called string
.
注意:该text
参数是一个旧名称,自 BeautifulSoup 4.4.0 以来它被称为string
.
From the docs:
从文档:
Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the tags whose .string is “Elsie”:
soup.find_all("a", string="Elsie") # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
尽管 string 用于查找字符串,但您可以将其与查找标签的参数结合使用:Beautiful Soup 将查找 .string 与您的 string 值匹配的所有标签。此代码查找 .string 为“Elsie”的标签:
soup.find_all("a", string="Elsie") # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
Now let's take a look what Tag
's string
attribute is (from the docsagain):
现在让我们看看Tag
's 的string
属性是什么(再次来自文档):
If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
title_tag.string # u'The Dormouse's story'
如果一个标签只有一个孩子,并且那个孩子是一个 NavigableString,那么孩子将作为 .string 可用:
title_tag.string # u'The Dormouse's story'
(...)
(……)
If a tag contains more than one thing, then it's not clear what .string should refer to, so .string is defined to be None:
print(soup.html.string) # None
如果一个标签包含不止一个东西,那么 .string 应该指代什么就不清楚了,所以 .string 被定义为 None:
print(soup.html.string) # None
This is exactly your case. Your <a>
tag contains a text and<i>
tag. Therefore, the find gets None
when trying to search for a string and thus it can't match.
这正是你的情况。您的<a>
标签包含文本和<i>
标签。因此, findNone
在尝试搜索字符串时得到,因此无法匹配。
How to solve this?
如何解决这个问题?
Maybe there is a better solution but I would probably go with something like this:
也许有更好的解决方案,但我可能会采用这样的方法:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")
for link in links:
if link.find(text=re.compile("Edit")):
thelink = link
break
print(thelink)
I think there are not too many links pointing to /customer-menu/1/accounts/1/update
so it should be fast enough.
我认为没有太多链接指向/customer-menu/1/accounts/1/update
所以它应该足够快。
回答by styvane
You can pass a functionthat return True
if a
textcontains "Edit" to .find
如果文本包含“编辑”,您可以传递一个返回的函数True
a
.find
In [51]: def Edit_in_text(tag):
....: return tag.name == 'a' and 'Edit' in tag.text
....:
In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update")
Out[52]:
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
EDIT:
编辑:
You can use the .get_text()
method instead of the text
in your function which gives the same result:
您可以在函数中使用.get_text()
方法而不是 ,text
这会产生相同的结果:
def Edit_in_text(tag):
return tag.name == 'a' and 'Edit' in tag.get_text()
回答by Amr
in one line using lambda
在一行中使用 lambda
soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)