如何在 python 中使用 Selenium 和 Beautifulsoup 解析网站？

Question

提问by twitch after coffee

New to programming and figured out how to navigate to where I need to go using Selenium. I'd like to parse the data now but not sure where to start. Can someone hold my hand a sec and point me in the right direction?

编程新手并想出了如何使用 Selenium 导航到我需要去的地方。我想现在解析数据，但不知道从哪里开始。有人可以握住我的手一秒钟，并指出我正确的方向吗？

Any help appreciated -

任何帮助表示赞赏 -

Answer 1

采纳答案by RocketDonkey

Assuming you are on the page you want to parse, Selenium stores the source HTML in the driver's page_sourceattribute. You would then load the page_sourceinto BeautifulSoupas follows:

假设您在要解析的页面上，Selenium 将源 HTML 存储在驱动程序的page_source属性中。这样，你会加载page_source到BeautifulSoup如下：

In [8]: from bs4 import BeautifulSoup

In [9]: from selenium import webdriver

In [10]: driver = webdriver.Firefox()

In [11]: driver.get('http://news.ycombinator.com')

In [12]: html = driver.page_source

In [13]: soup = BeautifulSoup(html)

In [14]: for tag in soup.find_all('title'):
   ....:     print tag.text
   ....:     
   ....:     
Hacker News

Answer 2

回答by Vor

Are you sure you want to use Selenium? For this reasons I used PyQt4, it's very powerful, and you can do what ever you want.

您确定要使用 Selenium 吗？出于这个原因，我使用了PyQt4，它非常强大，您可以随心所欲。

I can give you a sample code, that I just wrote, just change url and you good to go:

我可以给你一个我刚刚写的示例代码，只需更改 url 就可以了：

#! /usr/bin/env python2.7

from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *
from bs4 import BeautifulSoup
import sys, signal

class Browser(QWebView):
    def __init__(self):
        QWebView.__init__(self)
        self.loadProgress.connect(self._progress)
        self.loadFinished.connect(self._loadFinished)
        self.frame = self.page().currentFrame()

    def _progress(self, progress):
        print str(progress) + "%"

    def _loadFinished(self):
        print "Load Finished"
        html = unicode(self.frame.toHtml()).encode('utf-8')
        soup = BeautifulSoup(html)
        print soup.prettify()
        self.close()

if __name__ == "__main__":
    app = QApplication(sys.argv)
    br = Browser()
    url = QUrl('http://web site that can contain javascript.com')
    br.load(url)
    br.show()
    if signal.signal(signal.SIGINT, signal.SIG_DFL):
        sys.exit(app.exec_())
    app.exec_()

Answer 3

回答by root

As your question isn't particularly concrete, here's a simple example. To do something more useful read the BS docs. You will also find plenty of examples of selenium (and BS )usage here in SO.

由于你的问题不是特别具体，这里有一个简单的例子。要做一些更有用的事情，请阅读 BS文档。您还将在 SO 中找到大量硒（和 BS ）使用示例。

from selenium import webdriver
from bs4 import BeautifulSoup

browser=webdriver.Firefox()
browser.get('http://webpage.com')

soup=BeautifulSoup(browser.page_source)

#do something useful
#prints all the links with corresponding text

for link in soup.find_all('a'):
    print link.get('href',None),link.get_text()

如何在 python 中使用 Selenium 和 Beautifulsoup 解析网站？

提问by twitch after coffee

采纳答案by RocketDonkey

回答by Vor

回答by root

相关推荐

最近更新

标签

如何在 python 中使用 Selenium 和 Beautifulsoup 解析网站？

提问by twitch after coffee

采纳答案by RocketDonkey

回答by Vor

回答by root

相关推荐

Python 将整数转换为数字列表

Python 为什么在安装 cx_freeze 后出现“没有名为 cx_Freeze 的模块”错误？

Python 使用 pickle.dump - TypeError: 必须是 str，而不是字节

Python TypeError: 'NoneType' 对象没有属性 '__getitem__'

相关推荐

最近更新

标签

Python TypeError: 'NoneType' 对象没有属性 'getitem'