python 如果对象也有其他类,Beautiful Soup 找不到 CSS 类
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1242755/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Beautiful Soup cannot find a CSS class if the object has other classes, too
提问by endolith
if a page has <div class="class1">
and <p class="class1">
, then soup.findAll(True, 'class1')
will find them both.
如果页面有<div class="class1">
和<p class="class1">
,soup.findAll(True, 'class1')
则会同时找到它们。
If it has <p class="class1 class2">
, though, it will not be found. How do I find all objects with a certain class, regardless of whether they have other classes, too?
<p class="class1 class2">
但是,如果有,则不会被找到。我如何找到具有某个类的所有对象,而不管它们是否也有其他类?
采纳答案by Kugel
Just in case anybody comes across this question. BeautifulSoup now supports this:
以防万一有人遇到这个问题。BeautifulSoup 现在支持:
Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.
In [1]: import bs4
In [2]: soup = bs4.BeautifulSoup('<div class="foo bar"></div>')
In [3]: soup(attrs={'class': 'bar'})
Out[3]: [<div class="foo bar"></div>]
Also, you don't have to type findAll anymore.
此外,您不必再键入 findAll。
回答by endolith
Unfortunately, BeautifulSoup treats this as a class with a space in it 'class1 class2'
rather than two classes ['class1','class2']
. A workaround is to use a regular expression to search for the class instead of a string.
不幸的是,BeautifulSoup 将其视为一个带有空格的类'class1 class2'
而不是两个类['class1','class2']
。解决方法是使用正则表达式来搜索类而不是字符串。
This works:
这有效:
soup.findAll(True, {'class': re.compile(r'\bclass1\b')})
回答by aehlke
You should use lxml. It works with multiple class values separated by spaces ('class1 class2').
你应该使用lxml。它适用于由空格分隔的多个类值('class1 class2')。
Despite its name, lxml is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
尽管它的名字,lxml 也用于解析和抓取 HTML。它比 BeautifulSoup 快得多,甚至比 BeautifulSoup(他们声名鹊起)更好地处理“损坏的”HTML。如果您不想学习 lxml API,它也具有适用于 BeautifulSoup 的兼容性 API。
Ian Bicking agreesand prefers lxml over BeautifulSoup.
Ian Bicking 同意并更喜欢 lxml 而不是 BeautifulSoup。
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
没有理由再使用 BeautifulSoup,除非您使用的是 Google App Engine 或不允许任何非纯 Python 的东西。
You can even use CSS selectors with lxml, so it's far easier to use than BeautifulSoup. Try playing around with it in an interactive Python console.
您甚至可以在 lxml 中使用 CSS 选择器,因此它比 BeautifulSoup 更容易使用。尝试在交互式 Python 控制台中使用它。
回答by alan_wang
It's very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:
搜索具有特定 CSS 类的标签非常有用,但 CSS 属性的名称“class”在 Python 中是一个保留字。使用 class 作为关键字参数会给你一个语法错误。从 Beautiful Soup 4.1.2 开始,您可以使用关键字参数 class_ 按 CSS 类进行搜索:
Like:
喜欢:
soup.find_all("a", class_="class1")