Python 使用 Beautiful Soup 查找特定类

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41687476/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 01:26:43  来源:igfitidea点击:

Using Beautiful Soup to find specific class

pythonhtmlweb-scrapingbeautifulsoup

提问by SFBA26

I am trying to use Beautiful Soup to scrape housing price data from Zillow.

我正在尝试使用 Beautiful Soup 从 Zillow 中抓取房价数据。

I get the web page by property id, eg. http://www.zillow.com/homes/for_sale/18429834_zpid/

我通过属性 id 获取网页,例如。http://www.zillow.com/homes/for_sale/18429834_zpid/

When I try the find_all()function, I do not get any results:

当我尝试该find_all()功能时,我没有得到任何结果:

results = soup.find_all('div', attrs={"class":"home-summary-row"})

However, if I take the HTML and cut it down to just the bits I want, eg.:

但是,如果我使用 HTML 并将其缩减为我想要的部分,例如:

<html>
    <body>
        <div class=" status-icon-row for-sale-row home-summary-row">
        </div>
        <div class=" home-summary-row">
            <span class=""> ,342,144 </span>
        </div>
    </body>
</html>

I get 2 results, both <div>s with the class home-summary-row. So, my question is, why do I not get any results when searching the full page?

我得到 2 个结果,都<div>与 class home-summary-row。所以,我的问题是,为什么我在搜索整页时没有得到任何结果?



Working example:

工作示例:

from bs4 import BeautifulSoup
import requests

zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
response = requests.get(url)
html = response.content
#html = '<html><body><div class=" status-icon-row for-sale-row home-summary-row"></div><div class=" home-summary-row"><span class=""> ,342,144 </span></div></body></html>'
soup = BeautifulSoup(html, "html5lib")

results = soup.find_all('div', attrs={"class":"home-summary-row"})
print(results)

采纳答案by Soviut

According to the W3.org Validator, there are a number of issues with the HTML such as stray closing tags and tags split across multiple lines. For example:

根据W3.org Validator,HTML 存在许多问题,例如杂散的结束标记和跨多行拆分的标记。例如:

<a 
href="http://www.zillow.com/danville-ca-94526/sold/"  title="Recent home sales" class=""  data-za-action="Recent Home Sales"  >

This kind of markup can make it much more difficult for BeautifulSoup to parse the HTML.

这种标记会使 BeautifulSoup 解析 HTML 变得更加困难。

You may want to try running something to clean up the HTML, such as removing the line breaks and trailing spaces from the end of each line. BeautifulSoup can also clean up the HTML tree for you:

您可能想尝试运行一些东西来清理 HTML,例如删除每行末尾的换行符和尾随空格。BeautifulSoup 还可以为您清理 HTML 树:

from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

回答by alecxe

Your HTML is non-well-formedand in cases like this, choosing the right parser is crucial. In BeautifulSoup, there are currently 3 available HTML parsers which work and handle broken HTML differently:

您的 HTML 格式正确,在这种情况下,选择正确的解析器至关重要。在 中BeautifulSoup,目前有 3 个可用的 HTML 解析器,它们以不同的方式工作和处理损坏的 HTML

  • html.parser(built-in, no additional modules needed)
  • lxml(the fastest, requires lxmlto be installed)
  • html5lib(the most lenient, requires html5libto be installed)
  • html.parser(内置,不需要额外的模块)
  • lxml(最快,需要lxml安装)
  • html5lib(最宽松,需要html5lib安装)

The Differences between parsersdocumentation page describes the differences in more detail. In your case, to demonstrate the difference:

之间差异的解析器文档页面详细描述了差异。在您的情况下,要证明差异:

>>> from bs4 import BeautifulSoup
>>> import requests
>>> 
>>> zpid = "18429834"
>>> url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
>>> response = requests.get(url)
>>> html = response.content
>>> 
>>> len(BeautifulSoup(html, "html5lib").find_all('div', attrs={"class":"home-summary-row"}))
0
>>> len(BeautifulSoup(html, "html.parser").find_all('div', attrs={"class":"home-summary-row"}))
3
>>> len(BeautifulSoup(html, "lxml").find_all('div', attrs={"class":"home-summary-row"}))
3

As you can see, in your case, both html.parserand lxmldo the job, but html5libdoes not.

正如你所看到的,在你的情况下,无论是html.parserlxml做的工作,但html5lib没有。

回答by RobBenz

import requests
from bs4 import BeautifulSoup

zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"

r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

g_data = soup.find_all("div", {"class": "home-summary-row"})

print g_data[1].text

#for item in g_data:
#        print item("span")[0].text
#        print '\n'

I got this working too -- but it looks like someone beat me to it.

我也得到了这个 - 但看起来有人打败了我。

going to post anyways.

反正要发帖。