pandas Python解析JavaScript生成的HTML表格

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25062365/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:18:51  来源:igfitidea点击:

Python Parsing HTML Table Generated by JavaScript

javascriptpythonhtmlpandasbeautifulsoup

提问by user2643394

I'm trying to scrape a table from the NYSE website (http://www1.nyse.com/about/listed/IPO_Index.html) into a pandas dataframe. In order to do so, I have a setup like this:

我正在尝试将 NYSE 网站 ( http://www1.nyse.com/about/listed/IPO_Index.html) 中的表格抓取到 Pandas 数据框中。为了做到这一点,我有一个这样的设置:

def htmltodf(url):
page = requests.get(url)
soup = BeautifulSoup(page.text)

tables = soup.findAll('table')
test = pandas.io.html.read_html(str(tables))

return(test)            #return dataframe type object

However, when I run this on the page, all of the table returned in the list are essentially empty. When I further investigated, I found that the table is generated by javascript. When using the developer tools in my web browser, I see that the table looks like any other HTML table with the tags, etc. However, a view of the source code revealed something like this instead:

但是,当我在页面上运行它时,列表中返回的所有表基本上都是空的。当我进一步调查时,我发现该表是由 javascript 生成的。在我的 Web 浏览器中使用开发人员工具时,我看到该表看起来像任何其他带有标签等的 HTML 表。 但是,源代码的视图显示如下:

<script language="JavaScript">

.
.
.

<script>
var year = [["ICC","21st Century Oncology Holdings, Inc.","22 May  2014","/about/listed/icc.html" ],
... more entries here ...
,["ZOES","Zoe's Kitchen, Inc.","11 Apr 2014","/about/listed/zoes.html" ]] ;

    if(year.length != 0) 
    {   

    document.write ("<table width='619' border='0' cellspacing='0' cellpadding='0'><tr><td><span class='fontbold'>");
    document.write ('2014' + " IPO Showcase"); 
    document.write ("</span></td></tr></table>"); 
    }  
</script>

Therefore, when my HTML parser goes to look for the table tag, all it can find is the if condition, and no proper tags below that would indicate content. How can I scrape this table? Is there a tag that I can search for instead of table that will reveal the content? Because the code is not in traditional html table form, how do I read it in with pandas--do I have to manually parse the data?

因此,当我的 HTML 解析器去寻找 table 标签时,它所能找到的只是 if 条件,并且下面没有合适的标签来指示内容。我怎样才能刮这张桌子?是否有我可以搜索的标签而不是显示内容的表格?因为代码不是传统的html表格形式,如何用pandas读入——需要手动解析数据吗?

采纳答案by alecxe

In this case, you need something to run that javascript code for you.

在这种情况下,您需要一些东西来为您运行该 javascript 代码。

One option here would be to use selenium:

这里的一种选择是使用selenium

from pandas.io.html import read_html
from selenium import webdriver


driver = webdriver.Firefox()
driver.get('http://www1.nyse.com/about/listed/IPO_Index.html')

table = driver.find_element_by_xpath('//div[@class="sp5"]/table//table/..')
table_html = table.get_attribute('innerHTML')

df = read_html(table_html)[0]
print df

driver.close()

prints:

印刷:

                                                    0        1          2   3
0                                                Name   Symbol        NaT NaN
1                       Performance Sports Group Ltd.      PSG 2014-06-20 NaN
2                           Century Communities, Inc.      CCS 2014-06-18 NaN
3                        Foresight Energy Partners LP     FELP 2014-06-18 NaN
...
79  EGShares TCW EM Long Term Investment Grade Bon...     LEMF 2014-01-08 NaN
80  EGShares TCW EM Short Term Investment Grade Bo...     SEMF 2014-01-08 NaN

[81 rows x 4 columns]