Python BeautifulSoup webscraping find_all()：找到完全匹配

Question

提问by user2436815

I'm using Python and BeautifulSoup for web scraping.

我正在使用 Python 和 BeautifulSoup 进行网页抓取。

Lets say I have the following html code to scrape:

假设我有以下 html 代码要抓取：

<body>
    <div class="product">Product 1</div>
    <div class="product">Product 2</div>
    <div class="product special">Product 3</div>
    <div class="product special">Product 4</div>
</body>

Using BeautifulSoup, I want to find ONLY the products with the attribute class="product" (only Product 1 and 2), not the 'special' products

使用 BeautifulSoup，我只想找到属性 class="product"（仅产品 1 和 2）的产品，而不是“特殊”产品

If I do the following:

如果我执行以下操作：

result = soup.find_all('div', {'class': 'product'})

the result includes ALL the products (1,2,3, and 4).

结果包括所有产品（1、2、3 和 4）。

What should I do to find products whose class EXACTLY matches 'product'??

我应该怎么做才能找到类别与“产品”完全匹配的产品？

The Code I ran:

我运行的代码：

from bs4 import BeautifulSoup
import re

text = """
<body>
    <div class="product">Product 1</div>
    <div class="product">Product 2</div>
    <div class="product special">Product 3</div>
    <div class="product special">Product 4</div>
</body>"""

soup = BeautifulSoup(text)
result = soup.findAll(attrs={'class': re.compile(r"^product$")})
print result

Output:

输出：

[<div class="product">Product 1</div>, <div class="product">Product 2</div>, <div class="product special">Product 3</div>, <div class="product special">Product 4</div>]

Answer 1

回答by crunch

You can use CSS selectors like so:

你可以像这样使用 CSS 选择器：

result = soup.select('div.product.special')

css-selectors

css-选择器

Answer 2

回答by Martijn Pieters

In BeautifulSoup 4, the classattribute (and several other attributes, such as accesskeyand the headersattribute on table cell elements) is treated as a set; you match against individual elements listed in the attribute. This follows the HTML standard.

在 BeautifulSoup 4 中，class属性（以及其他几个属性，例如accesskey和headers表格单元格元素上的属性）被视为一个集合；您与属性中列出的各个元素进行匹配。这遵循 HTML 标准。

As such, you cannot limit the search to just one class.

因此，您不能将搜索限制为仅一类。

You'll have to use a custom functionhere to match against the class instead:

您必须在此处使用自定义函数来匹配类：

result = soup.find_all(lambda tag: tag.name == 'div' and 
                                   tag.get('class') == ['product'])

I used a lambdato create an anonymous function; each tag is matched on name (must be 'div'), and the class attribute must be exactly equal to the list ['product']; e.g. have just the one value.

我用 alambda创建了一个匿名函数；每个标签在名称上匹配（必须是'div'），并且类属性必须完全等于列表['product']；例如只有一个值。

Demo:

演示：

>>> from bs4 import BeautifulSoup
>>> text = """
... <body>
...     <div class="product">Product 1</div>
...     <div class="product">Product 2</div>
...     <div class="product special">Product 3</div>
...     <div class="product special">Product 4</div>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['product'])
[<div class="product">Product 1</div>, <div class="product">Product 2</div>]

For completeness sake, here are all such set attributes, from the BeautifulSoup source code:

为了完整起见，以下是来自 BeautifulSoup 源代码的所有此类设置属性：

# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'.  When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
cdata_list_attributes = {
    "*" : ['class', 'accesskey', 'dropzone'],
    "a" : ['rel', 'rev'],
    "link" :  ['rel', 'rev'],
    "td" : ["headers"],
    "th" : ["headers"],
    "td" : ["headers"],
    "form" : ["accept-charset"],
    "object" : ["archive"],

    # These are HTML5 specific, as are *.accesskey and *.dropzone above.
    "area" : ["rel"],
    "icon" : ["sizes"],
    "iframe" : ["sandbox"],
    "output" : ["for"],
    }

Answer 3

回答by Tarun Kiran

change your code from

更改您的代码

result = soup.findAll(attrs={'class': re.compile(r"^product$")})

to

到

result = soup.find_all(attrs={'class': 'product'})

and the result is a list and access through index

结果是一个列表和通过索引访问

Answer 4

回答by DeafaltCoder

soup.findAll(attrs={'class': re.compile(r"^product$")})

This code matches anything that doesn't have the productat the end of its class.

此代码匹配product在其类末尾没有的任何内容。

Python BeautifulSoup webscraping find_all()：找到完全匹配

提问by user2436815

回答by crunch

回答by Martijn Pieters

回答by Tarun Kiran

回答by DeafaltCoder

相关推荐

最近更新

标签

Python BeautifulSoup webscraping find_all()：找到完全匹配

提问by user2436815

回答by crunch

回答by Martijn Pieters

回答by Tarun Kiran

回答by DeafaltCoder

相关推荐

如何根据python中的第二列对二维数组（numpy.ndarray）进行排序？

Python 当数据具有现有键时更新嵌套字典

Python：Tkinter：为什么是 root.mainloop() 而不是 app.mainloop()

Python Sklearn set_params 只需要 1 个参数？

相关推荐

最近更新

标签