如何在 python 中使用 Selenium 和 Beautifulsoup 解析网站?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13960326/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I parse a website using Selenium and Beautifulsoup in python?
提问by twitch after coffee
New to programming and figured out how to navigate to where I need to go using Selenium. I'd like to parse the data now but not sure where to start. Can someone hold my hand a sec and point me in the right direction?
编程新手并想出了如何使用 Selenium 导航到我需要去的地方。我想现在解析数据,但不知道从哪里开始。有人可以握住我的手一秒钟,并指出我正确的方向吗?
Any help appreciated -
任何帮助表示赞赏 -
采纳答案by RocketDonkey
Assuming you are on the page you want to parse, Selenium stores the source HTML in the driver's page_sourceattribute. You would then load the page_sourceinto BeautifulSoupas follows:
假设您在要解析的页面上,Selenium 将源 HTML 存储在驱动程序的page_source属性中。这样,你会加载page_source到BeautifulSoup如下:
In [8]: from bs4 import BeautifulSoup
In [9]: from selenium import webdriver
In [10]: driver = webdriver.Firefox()
In [11]: driver.get('http://news.ycombinator.com')
In [12]: html = driver.page_source
In [13]: soup = BeautifulSoup(html)
In [14]: for tag in soup.find_all('title'):
....: print tag.text
....:
....:
Hacker News
回答by Vor
Are you sure you want to use Selenium? For this reasons I used PyQt4, it's very powerful, and you can do what ever you want.
您确定要使用 Selenium 吗?出于这个原因,我使用了PyQt4,它非常强大,您可以随心所欲。
I can give you a sample code, that I just wrote, just change url and you good to go:
我可以给你一个我刚刚写的示例代码,只需更改 url 就可以了:
#! /usr/bin/env python2.7
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *
from bs4 import BeautifulSoup
import sys, signal
class Browser(QWebView):
def __init__(self):
QWebView.__init__(self)
self.loadProgress.connect(self._progress)
self.loadFinished.connect(self._loadFinished)
self.frame = self.page().currentFrame()
def _progress(self, progress):
print str(progress) + "%"
def _loadFinished(self):
print "Load Finished"
html = unicode(self.frame.toHtml()).encode('utf-8')
soup = BeautifulSoup(html)
print soup.prettify()
self.close()
if __name__ == "__main__":
app = QApplication(sys.argv)
br = Browser()
url = QUrl('http://web site that can contain javascript.com')
br.load(url)
br.show()
if signal.signal(signal.SIGINT, signal.SIG_DFL):
sys.exit(app.exec_())
app.exec_()
回答by root
As your question isn't particularly concrete, here's a simple example. To do something more useful read the BS docs. You will also find plenty of examples of selenium (and BS )usage here in SO.
由于你的问题不是特别具体,这里有一个简单的例子。要做一些更有用的事情,请阅读 BS文档。您还将在 SO 中找到大量硒(和 BS )使用示例。
from selenium import webdriver
from bs4 import BeautifulSoup
browser=webdriver.Firefox()
browser.get('http://webpage.com')
soup=BeautifulSoup(browser.page_source)
#do something useful
#prints all the links with corresponding text
for link in soup.find_all('a'):
print link.get('href',None),link.get_text()

