Python BeautifulSoup - 按标签内的文本搜索

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31958637/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:48:48  来源:igfitidea点击:

BeautifulSoup - search by text inside a tag

pythonregexbeautifulsoup

提问by Eldamir

Observe the following problem:

观察以下问题:

import re
from bs4 import BeautifulSoup as BS

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    Edit
</a>
""")

# This returns the <a> element
soup.find(
    'a',
    href="/customer-menu/1/accounts/1/update",
    text=re.compile(".*Edit.*")
)

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    <i class="fa fa-edit"></i> Edit
</a>
""")

# This returns None
soup.find(
    'a',
    href="/customer-menu/1/accounts/1/update",
    text=re.compile(".*Edit.*")
)

For some reason, BeautifulSoup will not match the text, when the <i>tag is there as well. Finding the tag and showing its text produces

出于某种原因,当<i>标签也在那里时,BeautifulSoup 不会匹配文本。查找标签并显示其文本会产生

>>> a2 = soup.find(
        'a',
        href="/customer-menu/1/accounts/1/update"
    )
>>> print(repr(a2.text))
'\n Edit\n'

Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:

对。根据Docs,汤使用正则表达式的匹配功能,而不是搜索功能。所以我需要提供 DOTALL 标志:

pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n')  # Returns None

pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n')  # Returns MatchObject

Alright. Looks good. Let's try it with soup

好吧。看起来挺好的。用汤试试吧

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    <i class="fa fa-edit"></i> Edit
</a>
""")

soup.find(
    'a',
    href="/customer-menu/1/accounts/1/update",
    text=re.compile(".*Edit.*", flags=re.DOTALL)
)  # Still return None... Why?!

Edit

编辑

My solution based on geckons answer: I implemented these helpers:

我基于壁虎的解决方案回答:我实现了这些助手:

import re

MATCH_ALL = r'.*'


def like(string):
    """
    Return a compiled regular expression that matches the given
    string with any prefix and postfix, e.g. if string = "hello",
    the returned regex matches r".*hello.*"
    """
    string_ = string
    if not isinstance(string_, str):
        string_ = str(string_)
    regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
    return re.compile(regex, flags=re.DOTALL)


def find_by_text(soup, text, tag, **kwargs):
    """
    Find the tag in soup that matches all provided kwargs, and contains the
    text.

    If no match is found, return None.
    If more than one match is found, raise ValueError.
    """
    elements = soup.find_all(tag, **kwargs)
    matches = []
    for element in elements:
        if element.find(text=like(text)):
            matches.append(element)
    if len(matches) > 1:
        raise ValueError("Too many matches:\n" + "\n".join(matches))
    elif len(matches) == 0:
        return None
    else:
        return matches[0]

Now, when I want to find the element above, I just run find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')

现在,当我想找到上面的元素时,我只需运行 find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')

采纳答案by geckon

The problem is that your <a>tag with the <i>tag inside, doesn't have the stringattribute you expect it to have. First let's take a look at what text=""argument for find()does.

问题是您<a><i>标签内部带有标签,没有string您期望的属性。首先让我们看一下text=""参数 for 的find()作用。

NOTE: The textargument is an old name, since BeautifulSoup 4.4.0 it's called string.

注意:该text参数是一个旧名称,自 BeautifulSoup 4.4.0 以来它被称为string.

From the docs:

文档

Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the tags whose .string is “Elsie”:

soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

尽管 string 用于查找字符串,但您可以将其与查找标签的参数结合使用:Beautiful Soup 将查找 .string 与您的 string 值匹配的所有标签。此代码查找 .string 为“Elsie”的标签:

soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

Now let's take a look what Tag's stringattribute is (from the docsagain):

现在让我们看看Tag's 的string属性是什么(再次来自文档):

If a tag has only one child, and that child is a NavigableString, the child is made available as .string:

title_tag.string
# u'The Dormouse's story'

如果一个标签只有一个孩子,并且那个孩子是一个 NavigableString,那么孩子将作为 .string 可用:

title_tag.string
# u'The Dormouse's story'

(...)

(……)

If a tag contains more than one thing, then it's not clear what .string should refer to, so .string is defined to be None:

print(soup.html.string)
# None

如果一个标签包含不止一个东西,那么 .string 应该指代什么就不清楚了,所以 .string 被定义为 None:

print(soup.html.string)
# None

This is exactly your case. Your <a>tag contains a text and<i>tag. Therefore, the find gets Nonewhen trying to search for a string and thus it can't match.

这正是你的情况。您的<a>标签包含文本<i>标签。因此, findNone在尝试搜索字符串时得到,因此无法匹配。

How to solve this?

如何解决这个问题?

Maybe there is a better solution but I would probably go with something like this:

也许有更好的解决方案,但我可能会采用这样的方法:

import re
from bs4 import BeautifulSoup as BS

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    <i class="fa fa-edit"></i> Edit
</a>
""")

links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")

for link in links:
    if link.find(text=re.compile("Edit")):
        thelink = link
        break

print(thelink)

I think there are not too many links pointing to /customer-menu/1/accounts/1/updateso it should be fast enough.

我认为没有太多链接指向/customer-menu/1/accounts/1/update所以它应该足够快。

回答by styvane

You can pass a functionthat return Trueif atextcontains "Edit" to .find

如果文本包含“编辑”,您可以传递一个返回的函数Truea.find

In [51]: def Edit_in_text(tag):
   ....:     return tag.name == 'a' and 'Edit' in tag.text
   ....: 

In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update")
Out[52]: 
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>


EDIT:

编辑:

You can use the .get_text()method instead of the textin your function which gives the same result:

您可以在函数中使用.get_text()方法而不是 ,text这会产生相同的结果:

def Edit_in_text(tag):
    return tag.name == 'a' and 'Edit' in tag.get_text()

回答by Amr

in one line using lambda

在一行中使用 lambda

soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)