Python 如何使用 selenium 获取带有 javascript 渲染源代码的 html

Question

提问by MacSanhe

I run a query in one web page, then I get result url. If I right click see html source, I can see the html code generated by JS. If I simply use urllib, python cannot get the JS code. So I see some solution using selenium. Here's my code:

我在一个网页中运行查询，然后我得到结果 url。如果我右键查看html源代码，我可以看到JS生成的html代码。如果我只是使用 urllib，python 无法获取 JS 代码。所以我看到了一些使用硒的解决方案。这是我的代码：

from selenium import webdriver
url = 'http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2'
driver = webdriver.PhantomJS(executable_path='C:\python27\scripts\phantomjs.exe')
driver.get(url)
print driver.page_source

>>> <html><head></head><body></body></html>         Obviously It's not right!!

Here's the source code I need in right click windows, (I want the INFORMATION part)

这是我在右键单击窗口中需要的源代码，（我想要信息部分）

</script></div><div class="searchColRight"><div id="topActions" class="clearfix 
noPrint"><div id="breadcrumbs" class="left"><a title="Results Summary"
href="Default.aspx?    _act=VitalSearchR ...... <<INFORMATION I NEED>> ... 
to view the entire record.</p></div><script xmlns:msxsl="urn:schemas-microsoft-com:xslt">

        jQuery(document).ready(function() {
            jQuery(".ancestry-information-tooltip").actooltip({
href: "#AncestryInformationTooltip", orientation: "bottomleft"});
        });

=========== So my question is =============== How to get the information generated by JS?

============ 所以我的问题是================ JS生成的信息怎么获取？

Answer 1

采纳答案by Victory

You will need to get get the document via javascriptyou can use seleniums execute_scriptfunction

您需要通过javascript使用 seleniumsexecute_script函数来获取文档

from time import sleep # this should go at the top of the file

sleep(5)
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
print html

That will get everything inside of the <html>tag

这将获得<html>标签内的所有内容

Answer 2

回答by Robbie Wareham

I am thinking that you are getting the source code before the JavaScript has rendered the dynamic HTML.

我认为您是在 JavaScript 呈现动态 HTML 之前获取源代码。

Initially try putting a few seconds sleep between the navigate and get page source.

最初尝试在导航和获取页面源之间放置几秒钟的睡眠。

If this works, then you can change to a different wait strategy.

如果这有效，那么您可以更改为不同的等待策略。

Answer 3

回答by Darius M.

It's not necessary to use that workaround, you can use instead:

没有必要使用该解决方法，您可以使用：

driver = webdriver.PhantomJS()
driver.get('http://www.google.com/')
html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')

Answer 4

回答by Vida

I met the same problem and finally solved by desired_capabilities.

我遇到了同样的问题，最终通过所需的功能解决了。

from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy
from selenium.webdriver.common.proxy import ProxyType

proxy = Proxy(
     {
          'proxyType': ProxyType.MANUAL,
          'httpProxy': 'ip_or_host:port'
     }
)
desired_capabilities = webdriver.DesiredCapabilities.PHANTOMJS.copy()
proxy.add_to_capabilities(desired_capabilities)
driver = webdriver.PhantomJS(desired_capabilities=desired_capabilities)
driver.get('test_url')
print driver.page_source

Answer 5

回答by Harry1992

You try Dryscrapethis browser is fully supported heavy js codes try it i hope it work for you

你试试Dryscrape这个浏览器完全支持重js代码试试我希望它对你有用

Answer 6

回答by kuo chang

I have same problem about getting Javascript sourcecode from Internet, and I solved it using above Victory's suggestion.

我在从 Internet 获取 Javascript 源代码时遇到了同样的问题，我使用 Victory 的建议解决了这个问题。

*First: execute_script

*第一：execute_script

driver=webdriver.Chrome()
driver.get(urls)
innerHTML = driver.execute_script("return document.body.innerHTML")
#print(driver.page_source)

*Second: parse html using beautifulsoup(You can Downloaded beautifulsoupby pip command)

*第二：解析html使用beautifulsoup（您可以beautifulsoup通过pip命令下载）

 import bs4    #import beautifulsoup
 import re
 from time import sleep

 sleep(1)      #wait one second 
 root=bs4.BeautifulSoup(innerHTML,"lxml") #parse HTML using beautifulsoup
 viewcount=root.find_all("span",attrs={'class':'short-view-count style-scope yt-view-count-renderer'})   #find the value which you need.

*Third: print out the value you need

*第三：打印出你需要的值

 for span in viewcount:
    print(span.string)

*Full code

*完整代码

from selenium import webdriver
import lxml

urls="http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2"

driver = webdriver.PhantomJS()


##driver=webdriver.Chrome()
driver.get(urls)
innerHTML = driver.execute_script("return document.body.innerHTML")
##print(driver.page_source)

import bs4
import re
from time import sleep

sleep(1)
root=bs4.BeautifulSoup(innerHTML,"lxml")
viewcount=root.find_all("span",attrs={'class':'short-view-count style-scope yt-view-count-renderer'})


for span in viewcount:
print(span.string)

driver.quit()

Python 如何使用 selenium 获取带有 javascript 渲染源代码的 html

提问by MacSanhe

采纳答案by Victory

回答by Robbie Wareham

回答by Darius M.

回答by Vida

回答by Harry1992

回答by kuo chang

相关推荐

最近更新

标签

Python 如何使用 selenium 获取带有 javascript 渲染源代码的 html

提问by MacSanhe

采纳答案by Victory

回答by Robbie Wareham

回答by Darius M.

回答by Vida

回答by Harry1992

回答by kuo chang

相关推荐

Python 如何在 Flask-SQLAlchemy 中按 id 删除记录

Python ufunc bitwise_xor 的类型错误

Sqlite python sqlite3.OperationalError：数据库被锁定

在没有空格的python中打印列表

相关推荐

最近更新

标签