Python BeautifulSoup - 按标签内的文本搜索

Question

提问by Eldamir

Observe the following problem:

观察以下问题：

import re
from bs4 import BeautifulSoup as BS

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    Edit
</a>
""")

# This returns the <a> element
soup.find(
    'a',
    href="/customer-menu/1/accounts/1/update",
    text=re.compile(".*Edit.*")
)

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    <i class="fa fa-edit"></i> Edit
</a>
""")

# This returns None
soup.find(
    'a',
    href="/customer-menu/1/accounts/1/update",
    text=re.compile(".*Edit.*")
)

For some reason, BeautifulSoup will not match the text, when the tag is there as well. Finding the tag and showing its text produces

出于某种原因，当标签也在那里时，BeautifulSoup 不会匹配文本。查找标签并显示其文本会产生

>>> a2 = soup.find(
        'a',
        href="/customer-menu/1/accounts/1/update"
    )
>>> print(repr(a2.text))
'\n Edit\n'

Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:

对。根据Docs，汤使用正则表达式的匹配功能，而不是搜索功能。所以我需要提供 DOTALL 标志：

pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n')  # Returns None

pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n')  # Returns MatchObject

Alright. Looks good. Let's try it with soup

好吧。看起来挺好的。用汤试试吧

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    <i class="fa fa-edit"></i> Edit
</a>
""")

soup.find(
    'a',
    href="/customer-menu/1/accounts/1/update",
    text=re.compile(".*Edit.*", flags=re.DOTALL)
)  # Still return None... Why?!

Edit

编辑

My solution based on geckons answer: I implemented these helpers:

我基于壁虎的解决方案回答：我实现了这些助手：

import re

MATCH_ALL = r'.*'


def like(string):
    """
    Return a compiled regular expression that matches the given
    string with any prefix and postfix, e.g. if string = "hello",
    the returned regex matches r".*hello.*"
    """
    string_ = string
    if not isinstance(string_, str):
        string_ = str(string_)
    regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
    return re.compile(regex, flags=re.DOTALL)


def find_by_text(soup, text, tag, **kwargs):
    """
    Find the tag in soup that matches all provided kwargs, and contains the
    text.

    If no match is found, return None.
    If more than one match is found, raise ValueError.
    """
    elements = soup.find_all(tag, **kwargs)
    matches = []
    for element in elements:
        if element.find(text=like(text)):
            matches.append(element)
    if len(matches) > 1:
        raise ValueError("Too many matches:\n" + "\n".join(matches))
    elif len(matches) == 0:
        return None
    else:
        return matches[0]

Now, when I want to find the element above, I just run find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')

现在，当我想找到上面的元素时，我只需运行 find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')

Answer 1

采纳答案by geckon

The problem is that your <a>tag with the tag inside, doesn't have the stringattribute you expect it to have. First let's take a look at what text=""argument for find()does.

问题是您<a>的标签内部带有标签，没有string您期望的属性。首先让我们看一下text=""参数 for 的find()作用。

NOTE: The textargument is an old name, since BeautifulSoup 4.4.0 it's called string.

注意：该text参数是一个旧名称，自 BeautifulSoup 4.4.0 以来它被称为string.

From the docs:

从文档：

Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the tags whose .string is “Elsie”:
soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

尽管 string 用于查找字符串，但您可以将其与查找标签的参数结合使用：Beautiful Soup 将查找 .string 与您的 string 值匹配的所有标签。此代码查找 .string 为“Elsie”的标签：
soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

Now let's take a look what Tag's stringattribute is (from the docsagain):

现在让我们看看Tag's 的string属性是什么（再次来自文档）：

If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
title_tag.string
# u'The Dormouse's story'

如果一个标签只有一个孩子，并且那个孩子是一个 NavigableString，那么孩子将作为 .string 可用：
title_tag.string
# u'The Dormouse's story'

(...)

（……）

If a tag contains more than one thing, then it's not clear what .string should refer to, so .string is defined to be None:
print(soup.html.string)
# None

如果一个标签包含不止一个东西，那么 .string 应该指代什么就不清楚了，所以 .string 被定义为 None：
print(soup.html.string)
# None

This is exactly your case. Your <a>tag contains a text andtag. Therefore, the find gets Nonewhen trying to search for a string and thus it can't match.

这正是你的情况。您的<a>标签包含文本和标签。因此， findNone在尝试搜索字符串时得到，因此无法匹配。

How to solve this?

如何解决这个问题？

Maybe there is a better solution but I would probably go with something like this:

也许有更好的解决方案，但我可能会采用这样的方法：

import re
from bs4 import BeautifulSoup as BS

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    <i class="fa fa-edit"></i> Edit
</a>
""")

links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")

for link in links:
    if link.find(text=re.compile("Edit")):
        thelink = link
        break

print(thelink)

I think there are not too many links pointing to /customer-menu/1/accounts/1/updateso it should be fast enough.

我认为没有太多链接指向/customer-menu/1/accounts/1/update所以它应该足够快。

Answer 2

回答by styvane

You can pass a functionthat return Trueif atextcontains "Edit" to .find

如果文本包含“编辑”，您可以传递一个返回的函数Truea.find

In [51]: def Edit_in_text(tag):
   ....:     return tag.name == 'a' and 'Edit' in tag.text
   ....: 

In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update")
Out[52]: 
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>

EDIT:

编辑：

You can use the .get_text()method instead of the textin your function which gives the same result:

您可以在函数中使用.get_text()方法而不是，text这会产生相同的结果：

def Edit_in_text(tag):
    return tag.name == 'a' and 'Edit' in tag.get_text()

Answer 3

回答by Amr

in one line using lambda

在一行中使用 lambda

soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)

Python BeautifulSoup - 按标签内的文本搜索

提问by Eldamir

Edit

编辑

采纳答案by geckon

回答by styvane

回答by Amr

相关推荐

最近更新

标签

Python BeautifulSoup - 按标签内的文本搜索

提问by Eldamir

Edit

编辑

采纳答案by geckon

回答by styvane

回答by Amr

相关推荐

Python Pandas：转换为数字，必要时创建 NaN

如何获取python-elasticsearch中所有索引的列表

Python Android 上的 PyQt

如何在python中减去日期时间/时间戳

相关推荐

最近更新

标签