如何在 python 中使用 Selenium 和 Beautifulsoup 解析网站?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13960326/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 10:01:09  来源:igfitidea点击:

How can I parse a website using Selenium and Beautifulsoup in python?

pythonseleniumbeautifulsoup

提问by twitch after coffee

New to programming and figured out how to navigate to where I need to go using Selenium. I'd like to parse the data now but not sure where to start. Can someone hold my hand a sec and point me in the right direction?

编程新手并想出了如何使用 Selenium 导航到我需要去的地方。我想现在解析数据,但不知道从哪里开始。有人可以握住我的手一秒钟,并指出我正确的方向吗?

Any help appreciated -

任何帮助表示赞赏 -

采纳答案by RocketDonkey

Assuming you are on the page you want to parse, Selenium stores the source HTML in the driver's page_sourceattribute. You would then load the page_sourceinto BeautifulSoupas follows:

假设您在要解析的页面上,Selenium 将源 HTML 存储在驱动程序的page_source属性中。这样,你会加载page_sourceBeautifulSoup如下:

In [8]: from bs4 import BeautifulSoup

In [9]: from selenium import webdriver

In [10]: driver = webdriver.Firefox()

In [11]: driver.get('http://news.ycombinator.com')

In [12]: html = driver.page_source

In [13]: soup = BeautifulSoup(html)

In [14]: for tag in soup.find_all('title'):
   ....:     print tag.text
   ....:     
   ....:     
Hacker News

回答by Vor

Are you sure you want to use Selenium? For this reasons I used PyQt4, it's very powerful, and you can do what ever you want.

您确定要使用 Selenium 吗?出于这个原因,我使用了PyQt4,它非常强大,您可以随心所欲。

I can give you a sample code, that I just wrote, just change url and you good to go:

我可以给你一个我刚刚写的示例代码,只需更改 url 就可以了:

#! /usr/bin/env python2.7

from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *
from bs4 import BeautifulSoup
import sys, signal

class Browser(QWebView):
    def __init__(self):
        QWebView.__init__(self)
        self.loadProgress.connect(self._progress)
        self.loadFinished.connect(self._loadFinished)
        self.frame = self.page().currentFrame()

    def _progress(self, progress):
        print str(progress) + "%"

    def _loadFinished(self):
        print "Load Finished"
        html = unicode(self.frame.toHtml()).encode('utf-8')
        soup = BeautifulSoup(html)
        print soup.prettify()
        self.close()

if __name__ == "__main__":
    app = QApplication(sys.argv)
    br = Browser()
    url = QUrl('http://web site that can contain javascript.com')
    br.load(url)
    br.show()
    if signal.signal(signal.SIGINT, signal.SIG_DFL):
        sys.exit(app.exec_())
    app.exec_()

回答by root

As your question isn't particularly concrete, here's a simple example. To do something more useful read the BS docs. You will also find plenty of examples of selenium (and BS )usage here in SO.

由于你的问题不是特别具体,这里有一个简单的例子。要做一些更有用的事情,请阅读 BS文档。您还将在 SO 中找到大量硒(和 BS )使用示例。

from selenium import webdriver
from bs4 import BeautifulSoup

browser=webdriver.Firefox()
browser.get('http://webpage.com')

soup=BeautifulSoup(browser.page_source)

#do something useful
#prints all the links with corresponding text

for link in soup.find_all('a'):
    print link.get('href',None),link.get_text()