Python 使用 Beautiful Soup 查找特定类
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41687476/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using Beautiful Soup to find specific class
提问by SFBA26
I am trying to use Beautiful Soup to scrape housing price data from Zillow.
我正在尝试使用 Beautiful Soup 从 Zillow 中抓取房价数据。
I get the web page by property id, eg. http://www.zillow.com/homes/for_sale/18429834_zpid/
我通过属性 id 获取网页,例如。http://www.zillow.com/homes/for_sale/18429834_zpid/
When I try the find_all()
function, I do not get any results:
当我尝试该find_all()
功能时,我没有得到任何结果:
results = soup.find_all('div', attrs={"class":"home-summary-row"})
However, if I take the HTML and cut it down to just the bits I want, eg.:
但是,如果我使用 HTML 并将其缩减为我想要的部分,例如:
<html>
<body>
<div class=" status-icon-row for-sale-row home-summary-row">
</div>
<div class=" home-summary-row">
<span class=""> ,342,144 </span>
</div>
</body>
</html>
I get 2 results, both <div>
s with the class home-summary-row
. So, my question is, why do I not get any results when searching the full page?
我得到 2 个结果,都<div>
与 class home-summary-row
。所以,我的问题是,为什么我在搜索整页时没有得到任何结果?
Working example:
工作示例:
from bs4 import BeautifulSoup
import requests
zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
response = requests.get(url)
html = response.content
#html = '<html><body><div class=" status-icon-row for-sale-row home-summary-row"></div><div class=" home-summary-row"><span class=""> ,342,144 </span></div></body></html>'
soup = BeautifulSoup(html, "html5lib")
results = soup.find_all('div', attrs={"class":"home-summary-row"})
print(results)
采纳答案by Soviut
According to the W3.org Validator, there are a number of issues with the HTML such as stray closing tags and tags split across multiple lines. For example:
根据W3.org Validator,HTML 存在许多问题,例如杂散的结束标记和跨多行拆分的标记。例如:
<a
href="http://www.zillow.com/danville-ca-94526/sold/" title="Recent home sales" class="" data-za-action="Recent Home Sales" >
This kind of markup can make it much more difficult for BeautifulSoup to parse the HTML.
这种标记会使 BeautifulSoup 解析 HTML 变得更加困难。
You may want to try running something to clean up the HTML, such as removing the line breaks and trailing spaces from the end of each line. BeautifulSoup can also clean up the HTML tree for you:
您可能想尝试运行一些东西来清理 HTML,例如删除每行末尾的换行符和尾随空格。BeautifulSoup 还可以为您清理 HTML 树:
from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()
回答by alecxe
Your HTML is non-well-formedand in cases like this, choosing the right parser is crucial. In BeautifulSoup
, there are currently 3 available HTML parsers which work and handle broken HTML differently:
您的 HTML 格式不正确,在这种情况下,选择正确的解析器至关重要。在 中BeautifulSoup
,目前有 3 个可用的 HTML 解析器,它们以不同的方式工作和处理损坏的 HTML:
html.parser
(built-in, no additional modules needed)lxml
(the fastest, requireslxml
to be installed)html5lib
(the most lenient, requireshtml5lib
to be installed)
html.parser
(内置,不需要额外的模块)lxml
(最快,需要lxml
安装)html5lib
(最宽松,需要html5lib
安装)
The Differences between parsersdocumentation page describes the differences in more detail. In your case, to demonstrate the difference:
该之间差异的解析器文档页面详细描述了差异。在您的情况下,要证明差异:
>>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> zpid = "18429834"
>>> url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
>>> response = requests.get(url)
>>> html = response.content
>>>
>>> len(BeautifulSoup(html, "html5lib").find_all('div', attrs={"class":"home-summary-row"}))
0
>>> len(BeautifulSoup(html, "html.parser").find_all('div', attrs={"class":"home-summary-row"}))
3
>>> len(BeautifulSoup(html, "lxml").find_all('div', attrs={"class":"home-summary-row"}))
3
As you can see, in your case, both html.parser
and lxml
do the job, but html5lib
does not.
正如你所看到的,在你的情况下,无论是html.parser
和lxml
做的工作,但html5lib
没有。
回答by RobBenz
import requests
from bs4 import BeautifulSoup
zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("div", {"class": "home-summary-row"})
print g_data[1].text
#for item in g_data:
# print item("span")[0].text
# print '\n'
I got this working too -- but it looks like someone beat me to it.
我也得到了这个 - 但看起来有人打败了我。
going to post anyways.
反正要发帖。