使用 Python 抓取 JavaScript 生成的数据

Question

提问by trigger

I want to scrape some data of following url using Python. http://www.hankyung.com/stockplus/main.php?module=stock&mode=stock_analysis_infomation&itemcode=078340

我想使用 Python 抓取以下 url 的一些数据。 http://www.hankyung.com/stockplus/main.php?module=stock&mode=stock_analysis_infomation&itemcode=078340

It's about a summary of company information.

这是关于公司信息的摘要。

What I want to scrape is not shown on the first page. By clicking tab named "????", you can access financial statement. And clicking tab named "?????', you can access "Cash Flow".

我想抓取的内容没有显示在第一页上。通过单击名为“????”的选项卡，您可以访问财务报表。然后单击名为“?????”的选项卡，您可以访问“现金流量”。

I want to scrape the "Cash Flow" data.

我想抓取“现金流”数据。

However, Cash flow data is generated by javascript across the url. The following link is that url which is hidden, http://stock.kisline.com/compinfo/financial/main.action?vhead=N&vfoot=N&vstay=&omit=&vwidth=

但是，现金流数据是通过 url 由 javascript 生成的。以下链接是隐藏的网址，http://stock.kisline.com/compinfo/financial/main.action?vhead=N&vfoot=N&vstay=&omit=&vwidth =

Cash flow data is generated by submitting some option value and cookie to this url.

现金流数据是通过向这个 url 提交一些选项值和 cookie 来生成的。

As you perceived, itemcode=078340 in the first link means stock code and there are as many as 1680 stocks that I want gather cash flow data. I want make it a loop structure.

如您所见，第一个链接中的 itemcode=078340 表示股票代码，我想要收集现金流数据的股票多达 1680 只。我想让它成为一个循环结构。

Is there good way to scrape cash flow data? I tried scrapy but scrapy is difficult to cope with my another scraping code already I'm using.

有没有好的方法来抓取现金流数据？我尝试过scrapy，但是scrapy 很难处理我已经在使用的另一个抓取代码。

Answer 1

回答by Niklas B.

There's also dryscape(a library written by me, so the recommendation is a bit biased, obviously :) which uses a fast Webkit-based in-memory browser to navigate around. It understands Javascript, too, but is a lot more lightweight than Selenium.

还有dryscape（我写的一个库，所以推荐有点偏，显然:)它使用一个快速的基于Webkit的内存浏览器来导航。它也能理解 Javascript，但比 Selenium 轻得多。

Answer 2

回答by Mikko Ohtamaa

If you need to scape the page content which is updated with AJAX and you are not in the control of this AJAX interface I would use Selenium browser automator for the task:

如果您需要对使用 AJAX 更新的页面内容进行转义，并且您不受此 AJAX 界面的控制，我将使用 Selenium 浏览器自动器来完成任务：

http://code.google.com/p/selenium/

Selenium has Python bindings
It launches a real browser instance so it can do and scrape 100% the same thing as you see with your own eyes
Get HTML document content after AJAX updates thru Selenium API
Use lxml + xpath / CSS selectors to parse out the relevant parts out of the document

Selenium 具有 Python 绑定
它会启动一个真正的浏览器实例，因此它可以做和抓取 100% 与您亲眼看到的相同的事情
通过 Selenium API 在 AJAX 更新后获取 HTML 文档内容
使用 lxml + xpath / CSS 选择器从文档中解析出相关部分

使用 Python 抓取 JavaScript 生成的数据

提问by trigger

回答by Niklas B.

回答by Mikko Ohtamaa

相关推荐

最近更新

标签

使用 Python 抓取 JavaScript 生成的数据

提问by trigger

回答by Niklas B.

回答by Mikko Ohtamaa

相关推荐

javascript：带有回调和“this”的原型

将 Html.Raw() 存储在 Javascript、ASP.NET MVC 3 的字符串中

javascript Backbone.js：检查数据是否准备好以及数据集是否为空的优雅方式

javascript 只允许字母和空格 - jquery 验证插件

相关推荐

最近更新

标签