使用 Python 抓取 JavaScript 生成的数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10052465/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-26 08:33:58  来源:igfitidea点击:

Scraping javascript-generated data using Python

javascriptpythonscreen-scrapingweb-scraping

提问by trigger

I want to scrape some data of following url using Python. http://www.hankyung.com/stockplus/main.php?module=stock&mode=stock_analysis_infomation&itemcode=078340

我想使用 Python 抓取以下 url 的一些数据。 http://www.hankyung.com/stockplus/main.php?module=stock&mode=stock_analysis_infomation&itemcode=078340

It's about a summary of company information.

这是关于公司信息的摘要。

What I want to scrape is not shown on the first page. By clicking tab named "????", you can access financial statement. And clicking tab named "?????', you can access "Cash Flow".

我想抓取的内容没有显示在第一页上。通过单击名为“????”的选项卡,您可以访问财务报表。然后单击名为“?????”的选项卡,您可以访问“现金流量”。

I want to scrape the "Cash Flow" data.

我想抓取“现金流”数据。

However, Cash flow data is generated by javascript across the url. The following link is that url which is hidden, http://stock.kisline.com/compinfo/financial/main.action?vhead=N&vfoot=N&vstay=&omit=&vwidth=

但是,现金流数据是通过 url 由 javascript 生成的。以下链接是隐藏的网址,http://stock.kisline.com/compinfo/financial/main.action?vhead=N&vfoot=N&vstay=&omit=&vwidth =

Cash flow data is generated by submitting some option value and cookie to this url.

现金流数据是通过向这个 url 提交一些选项值和 cookie 来生成的。

As you perceived, itemcode=078340 in the first link means stock code and there are as many as 1680 stocks that I want gather cash flow data. I want make it a loop structure.

如您所见,第一个链接中的 itemcode=078340 表示股票代码,我想要收集现金流数据的股票多达 1680 只。我想让它成为一个循环结构。

Is there good way to scrape cash flow data? I tried scrapy but scrapy is difficult to cope with my another scraping code already I'm using.

有没有好的方法来抓取现金流数据?我尝试过scrapy,但是scrapy 很难处理我已经在使用的另一个抓取代码。

回答by Niklas B.

There's also dryscape(a library written by me, so the recommendation is a bit biased, obviously :) which uses a fast Webkit-based in-memory browser to navigate around. It understands Javascript, too, but is a lot more lightweight than Selenium.

还有dryscape(我写的一个库,所以推荐有点偏,显然:)它使用一个快速的基于Webkit的内存浏览器来导航。它也能理解 Javascript,但比 Selenium 轻得多。

回答by Mikko Ohtamaa

If you need to scape the page content which is updated with AJAX and you are not in the control of this AJAX interface I would use Selenium browser automator for the task:

如果您需要对使用 AJAX 更新的页面内容进行转义,并且您不受此 AJAX 界面的控制,我将使用 Selenium 浏览器自动器来完成任务:

http://code.google.com/p/selenium/

http://code.google.com/p/selenium/

  • Selenium has Python bindings

  • It launches a real browser instance so it can do and scrape 100% the same thing as you see with your own eyes

  • Get HTML document content after AJAX updates thru Selenium API

  • Use lxml + xpath / CSS selectors to parse out the relevant parts out of the document

  • Selenium 具有 Python 绑定

  • 它会启动一个真正的浏览器实例,因此它可以做和抓取 100% 与您亲眼看到的相同的事情

  • 通过 Selenium API 在 AJAX 更新后获取 HTML 文档内容

  • 使用 lxml + xpath / CSS 选择器从文档中解析出相关部分