Python BeautifulSoup webscraping find_all():找到完全匹配
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22726860/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
BeautifulSoup webscraping find_all( ): finding exact match
提问by user2436815
I'm using Python and BeautifulSoup for web scraping.
我正在使用 Python 和 BeautifulSoup 进行网页抓取。
Lets say I have the following html code to scrape:
假设我有以下 html 代码要抓取:
<body>
<div class="product">Product 1</div>
<div class="product">Product 2</div>
<div class="product special">Product 3</div>
<div class="product special">Product 4</div>
</body>
Using BeautifulSoup, I want to find ONLY the products with the attribute class="product" (only Product 1 and 2), not the 'special' products
使用 BeautifulSoup,我只想找到属性 class="product"(仅产品 1 和 2)的产品,而不是“特殊”产品
If I do the following:
如果我执行以下操作:
result = soup.find_all('div', {'class': 'product'})
the result includes ALL the products (1,2,3, and 4).
结果包括所有产品(1、2、3 和 4)。
What should I do to find products whose class EXACTLY matches 'product'??
我应该怎么做才能找到类别与“产品”完全匹配的产品?
The Code I ran:
我运行的代码:
from bs4 import BeautifulSoup
import re
text = """
<body>
<div class="product">Product 1</div>
<div class="product">Product 2</div>
<div class="product special">Product 3</div>
<div class="product special">Product 4</div>
</body>"""
soup = BeautifulSoup(text)
result = soup.findAll(attrs={'class': re.compile(r"^product$")})
print result
Output:
输出:
[<div class="product">Product 1</div>, <div class="product">Product 2</div>, <div class="product special">Product 3</div>, <div class="product special">Product 4</div>]
回答by crunch
You can use CSS selectors like so:
你可以像这样使用 CSS 选择器:
result = soup.select('div.product.special')
回答by Martijn Pieters
In BeautifulSoup 4, the classattribute (and several other attributes, such as accesskeyand the headersattribute on table cell elements) is treated as a set; you match against individual elements listed in the attribute. This follows the HTML standard.
在 BeautifulSoup 4 中,class属性(以及其他几个属性,例如accesskey和headers表格单元格元素上的属性)被视为一个集合;您与属性中列出的各个元素进行匹配。这遵循 HTML 标准。
As such, you cannot limit the search to just one class.
因此,您不能将搜索限制为仅一类。
You'll have to use a custom functionhere to match against the class instead:
您必须在此处使用自定义函数来匹配类:
result = soup.find_all(lambda tag: tag.name == 'div' and
tag.get('class') == ['product'])
I used a lambdato create an anonymous function; each tag is matched on name (must be 'div'), and the class attribute must be exactly equal to the list ['product']; e.g. have just the one value.
我用 alambda创建了一个匿名函数;每个标签在名称上匹配(必须是'div'),并且类属性必须完全等于列表['product'];例如只有一个值。
Demo:
演示:
>>> from bs4 import BeautifulSoup
>>> text = """
... <body>
... <div class="product">Product 1</div>
... <div class="product">Product 2</div>
... <div class="product special">Product 3</div>
... <div class="product special">Product 4</div>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['product'])
[<div class="product">Product 1</div>, <div class="product">Product 2</div>]
For completeness sake, here are all such set attributes, from the BeautifulSoup source code:
为了完整起见,以下是来自 BeautifulSoup 源代码的所有此类设置属性:
# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'. When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
cdata_list_attributes = {
"*" : ['class', 'accesskey', 'dropzone'],
"a" : ['rel', 'rev'],
"link" : ['rel', 'rev'],
"td" : ["headers"],
"th" : ["headers"],
"td" : ["headers"],
"form" : ["accept-charset"],
"object" : ["archive"],
# These are HTML5 specific, as are *.accesskey and *.dropzone above.
"area" : ["rel"],
"icon" : ["sizes"],
"iframe" : ["sandbox"],
"output" : ["for"],
}
回答by Tarun Kiran
change your code from
更改您的代码
result = soup.findAll(attrs={'class': re.compile(r"^product$")})
to
到
result = soup.find_all(attrs={'class': 'product'})
and the result is a list and access through index
结果是一个列表和通过索引访问
回答by DeafaltCoder
soup.findAll(attrs={'class': re.compile(r"^product$")})
This code matches anything that doesn't have the productat the end of its class.
此代码匹配product在其类末尾没有 的任何内容。

