Python 如何使用 selenium 获取带有 javascript 渲染源代码的 html

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22739514/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:33:00  来源:igfitidea点击:

How to get html with javascript rendered sourcecode by using selenium

javascriptpythonselenium

提问by MacSanhe

I run a query in one web page, then I get result url. If I right click see html source, I can see the html code generated by JS. If I simply use urllib, python cannot get the JS code. So I see some solution using selenium. Here's my code:

我在一个网页中运行查询,然后我得到结果 url。如果我右键查看html源代码,我可以看到JS生成的html代码。如果我只是使用 urllib,python 无法获取 JS 代码。所以我看到了一些使用硒的解决方案。这是我的代码:

from selenium import webdriver
url = 'http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2'
driver = webdriver.PhantomJS(executable_path='C:\python27\scripts\phantomjs.exe')
driver.get(url)
print driver.page_source

>>> <html><head></head><body></body></html>         Obviously It's not right!!

Here's the source code I need in right click windows, (I want the INFORMATION part)

这是我在右键单击窗口中需要的源代码,(我想要信息部分)

</script></div><div class="searchColRight"><div id="topActions" class="clearfix 
noPrint"><div id="breadcrumbs" class="left"><a title="Results Summary"
href="Default.aspx?    _act=VitalSearchR ...... <<INFORMATION I NEED>> ... 
to view the entire record.</p></div><script xmlns:msxsl="urn:schemas-microsoft-com:xslt">

        jQuery(document).ready(function() {
            jQuery(".ancestry-information-tooltip").actooltip({
href: "#AncestryInformationTooltip", orientation: "bottomleft"});
        });

=========== So my question is =============== How to get the information generated by JS?

============ 所以我的问题是================ JS生成的信息怎么获取?

采纳答案by Victory

You will need to get get the document via javascriptyou can use seleniums execute_scriptfunction

您需要通过javascript使用 seleniumsexecute_script函数来获取文档

from time import sleep # this should go at the top of the file

sleep(5)
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
print html

That will get everything inside of the <html>tag

这将获得<html>标签内的所有内容

回答by Robbie Wareham

I am thinking that you are getting the source code before the JavaScript has rendered the dynamic HTML.

我认为您是在 JavaScript 呈现动态 HTML 之前获取源代码。

Initially try putting a few seconds sleep between the navigate and get page source.

最初尝试在导航和获取页面源之间放置几秒钟的睡眠。

If this works, then you can change to a different wait strategy.

如果这有效,那么您可以更改为不同的等待策略。

回答by Darius M.

It's not necessary to use that workaround, you can use instead:

没有必要使用该解决方法,您可以使用:

driver = webdriver.PhantomJS()
driver.get('http://www.google.com/')
html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')

回答by Vida

I met the same problem and finally solved by desired_capabilities.

我遇到了同样的问题,最终通过所需的功能解决了。

from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy
from selenium.webdriver.common.proxy import ProxyType

proxy = Proxy(
     {
          'proxyType': ProxyType.MANUAL,
          'httpProxy': 'ip_or_host:port'
     }
)
desired_capabilities = webdriver.DesiredCapabilities.PHANTOMJS.copy()
proxy.add_to_capabilities(desired_capabilities)
driver = webdriver.PhantomJS(desired_capabilities=desired_capabilities)
driver.get('test_url')
print driver.page_source

回答by Harry1992

You try Dryscrapethis browser is fully supported heavy js codes try it i hope it work for you

你试试Dryscrape这个浏览器完全支持重js代码试试我希望它对你有用

回答by kuo chang

I have same problem about getting Javascript sourcecode from Internet, and I solved it using above Victory's suggestion.

我在从 Internet 获取 Javascript 源代码时遇到了同样的问题,我使用 Victory 的建议解决了这个问题。

*First: execute_script

*第一execute_script

driver=webdriver.Chrome()
driver.get(urls)
innerHTML = driver.execute_script("return document.body.innerHTML")
#print(driver.page_source)

*Second: parse html using beautifulsoup(You can Downloaded beautifulsoupby pip command)

*第二:解析html使用beautifulsoup(您可以beautifulsoup通过pip命令下载)

 import bs4    #import beautifulsoup
 import re
 from time import sleep

 sleep(1)      #wait one second 
 root=bs4.BeautifulSoup(innerHTML,"lxml") #parse HTML using beautifulsoup
 viewcount=root.find_all("span",attrs={'class':'short-view-count style-scope yt-view-count-renderer'})   #find the value which you need.

*Third: print out the value you need

*第三:打印出你需要的值

 for span in viewcount:
    print(span.string) 

*Full code

*完整代码

from selenium import webdriver
import lxml

urls="http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2"

driver = webdriver.PhantomJS()


##driver=webdriver.Chrome()
driver.get(urls)
innerHTML = driver.execute_script("return document.body.innerHTML")
##print(driver.page_source)

import bs4
import re
from time import sleep

sleep(1)
root=bs4.BeautifulSoup(innerHTML,"lxml")
viewcount=root.find_all("span",attrs={'class':'short-view-count style-scope yt-view-count-renderer'})


for span in viewcount:
print(span.string)

driver.quit()