Python 可靠地检测页面加载或超时，Selenium 2

Question

提问by zwol

I am writing a generic web-scraper using Selenium 2 (version 2.33 Python bindings, Firefox driver). It is supposed to take an arbitraryURL, load the page, and report all of the outbound links. Because the URL is arbitrary, I cannot make any assumptions whatsoever about the contents of the page, so the usual advice (wait for a specific element to be present) is inapplicable.

我正在使用 Selenium 2（版本 2.33 Python 绑定，Firefox 驱动程序）编写一个通用的网络爬虫。它应该采用任意URL，加载页面，并报告所有出站链接。因为 URL 是任意的，我不能对页面的内容做任何假设，所以通常的建议（等待特定元素出现）是不适用的。

I have code which is supposed to poll document.readyStateuntil it reaches "complete" or a 30s timeout has elapsed, and then proceed:

我有应该轮询document.readyState直到达到“完成”或 30 秒超时的代码，然后继续：

def readystate_complete(d):
    # AFAICT Selenium offers no better way to wait for the document to be loaded,
    # if one is in ignorance of its contents.
    return d.execute_script("return document.readyState") == "complete"

def load_page(driver, url):
    try:
        driver.get(url)
        WebDriverWait(driver, 30).until(readystate_complete)
    except WebDriverException:
        pass

    links = []
    try:
        for elt in driver.find_elements_by_xpath("//a[@href]"):
            try: links.append(elt.get_attribute("href"))
            except WebDriverException: pass
    except WebDriverException: pass
    return links

This sort-of works, but on about one page out of five, the .untilcall hangs forever. When this happens, usually the browser has not in fact finished loading the page (the "throbber" is still spinning) but tens of minutes can go by and the timeout does not trigger. But sometimes the page does appear to have loaded completely and the script still does not go on.

这种方式有效，但在大约五分之一的页面上，.until呼叫永远挂起。发生这种情况时，通常浏览器实际上还没有完成页面加载（“悸动者”仍在旋转），但可以过去几十分钟并且不会触发超时。但有时页面似乎已完全加载，脚本仍然无法继续。

What gives? How do I make the timeout work reliably? Is there a better way to request a wait-for-page-to-load (if one cannot make any assumptions about the contents)?

是什么赋予了？如何使超时可靠地工作？有没有更好的方法来请求等待页面加载（如果不能对内容做出任何假设）？

Note: The obsessive catching-and-ignoring of WebDriverExceptionhas proven necessary to ensure that it extracts as many links from the page as possible, whether or not JavaScript inside the page is doing funny stuff with the DOM (e.g. I used to get "stale element" errors in the loop that extracts the HREF attributes).

注意：WebDriverException无论页面内的 JavaScript 是否正在使用 DOM 做一些有趣的事情（例如，我曾经获取“过时的元素”），对" 提取 HREF 属性的循环中的错误）。

NOTE:There are a lot of variations on this question both on this site and elsewhere, but they've all either got a subtle but critical difference that makes the answers (if any) useless to me, or I've tried the suggestions and they don't work. Please answer exactlythe question I have asked.

注意：这个问题在这个网站和其他地方都有很多变化，但它们都有一个微妙但关键的差异，这使得答案（如果有的话）对我没用，或者我已经尝试了建议和他们不工作。 请准确回答我提出的问题。

Answer 1

回答by Joe Coder

If the page is still loading indefinitely, I'm guessing the readyState never reaches "complete". If you're using Firefox, you can force the page loading to halt by calling window.stop():

如果页面仍在无限期加载，我猜 readyState 永远不会达到“完成”。如果您使用的是 Firefox，您可以通过调用强制页面加载停止window.stop()：

try:
    driver.get(url)
    WebDriverWait(driver, 30).until(readystate_complete)
except TimeoutException:
    d.execute_script("window.stop();")

Answer 2

回答by Lukus

I have a similar situation as I wrote the screenshot system using Selenium for a fairly well-known website service and had the same predicament: I could not know anything about the page being loaded.

我有类似的情况，因为我使用 Selenium 为一个相当知名的网站服务编写了屏幕截图系统，并且遇到了同样的困境：我对正在加载的页面一无所知。

After speaking with some of the Selenium developers, the answer was that various WebDriver implementations (Firefox Driver versus IEDriver for example) make different choices about when a page is considered to be loaded or not for the WebDriver to return control.

在与一些 Selenium 开发人员交谈后，答案是各种 WebDriver 实现（例如 Firefox Driver 与 IEDriver）对页面何时被加载或不让 WebDriver 返回控制权做出不同的选择。

If you dig deep in Selenium code, you can find the spots that try and make the best choices, but since there are a number of things that can cause the state being looked for to fail, like multiple frames where one doesn't complete in a timely manner, there are cases where the driver obviously just does not return.

如果您深入研究 Selenium 代码，您可以找到尝试做出最佳选择的点，但是由于有很多事情会导致正在寻找的状态失败，例如多个帧中一个没有完成及时，有司机显然只是不回来的情况。

I was told, "it's an open-source project", and that it probably won't/can't be corrected for every possible scenario, but that I could make fixes and submit patches where applicable.

有人告诉我，“这是一个开源项目”，并且可能不会/无法针对每种可能的情况进行更正，但我可以进行修复并在适用的情况下提交补丁。

In the long run, that was a bit much for me to take on, so similar to you, I created my own timeout process. Since I use Java, I created a new Thread that upon reaching the timeout, tries to do several things to get WebDriver to return, even at times just pressing certain Keys to get the browser to respond has worked. If it does not return, then I kill the browser and try again as well.

从长远来看，这对我来说有点困难，因此与您相似，我创建了自己的超时流程。由于我使用 Java，我创建了一个新线程，当达到超时时，它会尝试做几件事来让 WebDriver 返回，即使有时只是按某些键来让浏览器响应也有效。如果它没有返回，那么我杀死浏览器并重试。

Starting the driver again has handled most cases for us, as if the second load of the browser allowed it to be in a more settled state (mind you we are launching from VMs and the browser constantly wants to check for updates and run certain routines when it hasn't been launched recently).

再次启动驱动程序已经为我们处理了大多数情况，就好像浏览器的第二次加载允许它处于更稳定的状态一样（请注意，我们是从虚拟机启动的，浏览器不断想要检查更新并运行某些例程，当它最近还没有推出）。

Another piece of this is that we launch a known url first and confirm some aspects about the browser and that we are in fact able to interact with it before continuing. With these steps together the failure rate is pretty low, about 3% with 1000s of tests on all browsers/version/OSs (FF, IE, CHROME, Safari, Opera, iOS, Android, etc.)

另一部分是我们首先启动一个已知的 url 并确认浏览器的某些方面，并且在继续之前我们实际上能够与它进行交互。通过这些步骤，失败率非常低，在所有浏览器/版本/操作系统（FF、IE、CHROME、Safari、Opera、iOS、Android 等）上进行了 1000 次测试，失败率约为 3%。

Last but not least, for your case, it sounds like you only really need to capture the links on the page, not have full browser automation. There are other approaches I might take toward that, namesly cURL and linux tools.

最后但并非最不重要的一点是，对于您的情况，听起来您只需要捕获页面上的链接，而不是完全的浏览器自动化。我可能会采用其他方法，即 cURL 和 linux 工具。

Answer 3

回答by Erki M.

As far as i know, your readystate_completeis not doing anything as driver.get() is already checking for that condition. Anyway, i have seen it not working in many cases. One thing you could try is to route your traffic thru a proxy and use that for pinging for any network traffic. Ie browsermobhas wait_for_traffic_to_stop method:

据我所知，您readystate_complete没有做任何事情，因为 driver.get() 已经在检查该条件。无论如何，我已经看到它在很多情况下都不起作用。您可以尝试的一件事是通过代理路由您的流量，并将其用于 ping 任何网络流量。即browsermob有 wait_for_traffic_to_stop 方法：

def wait_for_traffic_to_stop(self, quiet_period, timeout):
"""
Waits for the network to be quiet
:Args:
- quiet_period - number of seconds the network needs to be quiet for
- timeout - max number of seconds to wait
"""
    r = requests.put('%s/proxy/%s/wait' % (self.host, self.port),
        {'quietPeriodInMs': quiet_period, 'timeoutInMs': timeout})
    return r.status_code

Answer 4

回答by kenorb

Here is solution proposed by Tommy Beadle(by using stalenessapproach):

这是Tommy Beadle提出的解决方案（通过使用陈旧方法）：

import contextlib
from selenium.webdriver import Remote
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import staleness_of

class MyRemote(Remote):
    @contextlib.contextmanager
    def wait_for_page_load(self, timeout=30):
        old_page = self.find_element_by_tag_name('html')
        yield
        WebDriverWait(self, timeout).until(staleness_of(old_page))

Answer 5

回答by kenorb

The "recommended" (however still ugly) solution could be to use explicit wait:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions

old_value = browser.find_element_by_id('thing-on-old-page').text
browser.find_element_by_link_text('my link').click()
WebDriverWait(browser, 3).until(
    expected_conditions.text_to_be_present_in_element(
        (By.ID, 'thing-on-new-page'),
        'expected new text'
    )
)

The naive attempt would be something like this:

def wait_for(condition_function):
    start_time = time.time()
    while time.time() < start_time + 3:
        if condition_function():
            return True
        else:
            time.sleep(0.1)
    raise Exception(
        'Timeout waiting for {}'.format(condition_function.__name__)
    )


def click_through_to_new_page(link_text):
    browser.find_element_by_link_text('my link').click()

    def page_has_loaded():
        page_state = browser.execute_script(
            'return document.readyState;'
        ) 
        return page_state == 'complete'

    wait_for(page_has_loaded)

Another, better one would be (credits to @ThomasMarks):

def click_through_to_new_page(link_text):
    link = browser.find_element_by_link_text('my link')
    link.click()

    def link_has_gone_stale():
        try:
            # poll the link with an arbitrary call
            link.find_elements_by_id('doesnt-matter') 
            return False
        except StaleElementReferenceException:
            return True

    wait_for(link_has_gone_stale)

And the final example includes comparing page ids as below (which could be bulletproof):

class wait_for_page_load(object):

    def __init__(self, browser):
        self.browser = browser

    def __enter__(self):
        self.old_page = self.browser.find_element_by_tag_name('html')

    def page_has_loaded(self):
        new_page = self.browser.find_element_by_tag_name('html')
        return new_page.id != self.old_page.id

    def __exit__(self, *_):
        wait_for(self.page_has_loaded)

And now we can do:

with wait_for_page_load(browser):
    browser.find_element_by_link_text('my link').click()

“推荐”（但仍然丑陋）的解决方案可能是使用显式等待：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions

old_value = browser.find_element_by_id('thing-on-old-page').text
browser.find_element_by_link_text('my link').click()
WebDriverWait(browser, 3).until(
    expected_conditions.text_to_be_present_in_element(
        (By.ID, 'thing-on-new-page'),
        'expected new text'
    )
)

天真的尝试是这样的：

def wait_for(condition_function):
    start_time = time.time()
    while time.time() < start_time + 3:
        if condition_function():
            return True
        else:
            time.sleep(0.1)
    raise Exception(
        'Timeout waiting for {}'.format(condition_function.__name__)
    )


def click_through_to_new_page(link_text):
    browser.find_element_by_link_text('my link').click()

    def page_has_loaded():
        page_state = browser.execute_script(
            'return document.readyState;'
        ) 
        return page_state == 'complete'

    wait_for(page_has_loaded)

另一个更好的方法是（归功于@ThomasMarks）：

def click_through_to_new_page(link_text):
    link = browser.find_element_by_link_text('my link')
    link.click()

    def link_has_gone_stale():
        try:
            # poll the link with an arbitrary call
            link.find_elements_by_id('doesnt-matter') 
            return False
        except StaleElementReferenceException:
            return True

    wait_for(link_has_gone_stale)

最后一个示例包括比较页面 ID，如下所示（这可能是防弹的）：

class wait_for_page_load(object):

    def __init__(self, browser):
        self.browser = browser

    def __enter__(self):
        self.old_page = self.browser.find_element_by_tag_name('html')

    def page_has_loaded(self):
        new_page = self.browser.find_element_by_tag_name('html')
        return new_page.id != self.old_page.id

    def __exit__(self, *_):
        wait_for(self.page_has_loaded)

现在我们可以这样做：

with wait_for_page_load(browser):
    browser.find_element_by_link_text('my link').click()

Above code samples are from Harry's blog.

以上代码示例来自Harry 的博客。

Python 可靠地检测页面加载或超时，Selenium 2

提问by zwol

回答by Joe Coder

回答by Lukus

回答by Erki M.

回答by kenorb

回答by kenorb

相关推荐

最近更新

标签

Python 可靠地检测页面加载或超时，Selenium 2

提问by zwol

回答by Joe Coder

回答by Lukus

回答by Erki M.

回答by kenorb

回答by kenorb

相关推荐

Python 按元素添加 2 个列表？

Python Pandas 数据框另存为 HTML 页面

如何在python中计算质心

Python合并具有所有可能排列的两个列表

相关推荐

最近更新

标签