使用python读取动态生成的网页

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13960567/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 10:01:23  来源:igfitidea点击:

Reading dynamically generated web pages using python

pythonweb-scraping

提问by Ajay Nair

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes. What I see in the source code is:

我正在尝试使用 python 和漂亮的汤来抓取一个网站。我遇到了在一些网站上,虽然在浏览器上看到的图片链接在源代码中是看不到的。但是在使用 Chrome Inspect 或 Fiddler 时,我们可以看到相应的代码。我在源代码中看到的是:

<div id="cntnt"></div>

But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.

但是在 Chrome Inspect 上,我可以看到在这个 div 类中生成了一大堆 HTML\CSS 代码。有没有办法在python中加载生成的内容?我在 python 中使用常规 urllib,我能够获取源代码但没有生成的部分。

I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !

我不是网络开发人员,因此我无法以更好的方式表达行为。如果我的问题似乎含糊不清,请随时澄清!

回答by ppsreejith

The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer

网站的内容可能是通过javascript加载后生成的,为了通过python获取生成的脚本,请参考这个答案

回答by ivan_pozdeev

A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browserthat would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.

常规抓取工具只获取 HTML 文档。要获取由 JavaScript 逻辑生成的任何内容,您需要一个Headless 浏览器,它也可以像常规浏览器一样生成 DOM、加载和运行脚本。维基百科文章和网络上的其他一些页面列出了这些内容及其功能。

Keep in mind when choosing that some previously major products of those are abandoned now.

选择时请记住,其中一些以前的主要产品现在已被放弃。

回答by TheHeadlessSourceMan

TRY THIS FIRST!

先试试这个!

Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)

也许技术上的数据可以在 javascript 本身中,并且需要所有这些 javascript 引擎业务。(这里有一些很棒的链接!)

But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!

但是根据经验,我的第一个猜测是 JS 通过 ajax 请求拉取数据。如果你能让你的程序模拟它,你可能会得到你需要的一切,而无需涉及任何繁琐的解析/执行/抓取!

It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.

不过,这需要一些侦探工作。我建议打开您的网络流量记录器(例如 Firefox 中的“Web Developer Toolbar”),然后访问该站点。将注意力集中在任何/所有 XmlHTTPRequests 上。您需要的数据应该在这些响应之一的某个地方,可能在一些 JSON 文本的中间。

Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

现在,看看您是否可以重新创建该请求并直接获取数据。(注意:您可能必须设置请求的用户代理,以便服务器认为您是“真正的”网络浏览器。)