Python 使用 Beautiful Soup 查找特定类

Question

提问by SFBA26

I am trying to use Beautiful Soup to scrape housing price data from Zillow.

我正在尝试使用 Beautiful Soup 从 Zillow 中抓取房价数据。

I get the web page by property id, eg. http://www.zillow.com/homes/for_sale/18429834_zpid/

我通过属性 id 获取网页，例如。http://www.zillow.com/homes/for_sale/18429834_zpid/

When I try the find_all()function, I do not get any results:

当我尝试该find_all()功能时，我没有得到任何结果：

results = soup.find_all('div', attrs={"class":"home-summary-row"})

However, if I take the HTML and cut it down to just the bits I want, eg.:

但是，如果我使用 HTML 并将其缩减为我想要的部分，例如：

<html>
    <body>
        <div class=" status-icon-row for-sale-row home-summary-row">
        </div>
        <div class=" home-summary-row">
            <span class=""> ,342,144 </span>
        </div>
    </body>
</html>

I get 2 results, both <div>s with the class home-summary-row. So, my question is, why do I not get any results when searching the full page?

我得到 2 个结果，都<div>与 class home-summary-row。所以，我的问题是，为什么我在搜索整页时没有得到任何结果？

Working example:

工作示例：

from bs4 import BeautifulSoup
import requests

zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
response = requests.get(url)
html = response.content
#html = '<html><body><div class=" status-icon-row for-sale-row home-summary-row"></div><div class=" home-summary-row"><span class=""> ,342,144 </span></div></body></html>'
soup = BeautifulSoup(html, "html5lib")

results = soup.find_all('div', attrs={"class":"home-summary-row"})
print(results)

Answer 1

采纳答案by Soviut

According to the W3.org Validator, there are a number of issues with the HTML such as stray closing tags and tags split across multiple lines. For example:

根据W3.org Validator，HTML 存在许多问题，例如杂散的结束标记和跨多行拆分的标记。例如：

<a 
href="http://www.zillow.com/danville-ca-94526/sold/"  title="Recent home sales" class=""  data-za-action="Recent Home Sales"  >

This kind of markup can make it much more difficult for BeautifulSoup to parse the HTML.

这种标记会使 BeautifulSoup 解析 HTML 变得更加困难。

You may want to try running something to clean up the HTML, such as removing the line breaks and trailing spaces from the end of each line. BeautifulSoup can also clean up the HTML tree for you:

您可能想尝试运行一些东西来清理 HTML，例如删除每行末尾的换行符和尾随空格。BeautifulSoup 还可以为您清理 HTML 树：

from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

Answer 2

回答by alecxe

Your HTML is non-well-formedand in cases like this, choosing the right parser is crucial. In BeautifulSoup, there are currently 3 available HTML parsers which work and handle broken HTML differently:

您的 HTML 格式不正确，在这种情况下，选择正确的解析器至关重要。在中BeautifulSoup，目前有 3 个可用的 HTML 解析器，它们以不同的方式工作和处理损坏的 HTML：

html.parser(built-in, no additional modules needed)
lxml(the fastest, requires lxmlto be installed)
html5lib(the most lenient, requires html5libto be installed)

html.parser（内置，不需要额外的模块）
lxml（最快，需要lxml安装）
html5lib（最宽松，需要html5lib安装）

The Differences between parsersdocumentation page describes the differences in more detail. In your case, to demonstrate the difference:

该之间差异的解析器文档页面详细描述了差异。在您的情况下，要证明差异：

>>> from bs4 import BeautifulSoup
>>> import requests
>>> 
>>> zpid = "18429834"
>>> url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
>>> response = requests.get(url)
>>> html = response.content
>>> 
>>> len(BeautifulSoup(html, "html5lib").find_all('div', attrs={"class":"home-summary-row"}))
0
>>> len(BeautifulSoup(html, "html.parser").find_all('div', attrs={"class":"home-summary-row"}))
3
>>> len(BeautifulSoup(html, "lxml").find_all('div', attrs={"class":"home-summary-row"}))
3

As you can see, in your case, both html.parserand lxmldo the job, but html5libdoes not.

正如你所看到的，在你的情况下，无论是html.parser和lxml做的工作，但html5lib没有。

Answer 3

回答by RobBenz

import requests
from bs4 import BeautifulSoup

zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"

r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

g_data = soup.find_all("div", {"class": "home-summary-row"})

print g_data[1].text

#for item in g_data:
#        print item("span")[0].text
#        print '\n'

I got this working too -- but it looks like someone beat me to it.

我也得到了这个 - 但看起来有人打败了我。

going to post anyways.

反正要发帖。

Python 使用 Beautiful Soup 查找特定类

提问by SFBA26

采纳答案by Soviut

回答by alecxe

回答by RobBenz

相关推荐

最近更新

标签

Python 使用 Beautiful Soup 查找特定类

提问by SFBA26

采纳答案by Soviut

回答by alecxe

回答by RobBenz

相关推荐

如何在 Python 3 中打印异常？

Python 使用 mysite.urls 中定义的 URLconf，Django 按以下顺序尝试了这些 URL 模式：

Python 类型错误：“dict”对象在使用 dict() 时不可调用

Python 您必须使用 dtype float 为占位符张量“Placeholder”提供一个值

相关推荐

最近更新

标签