等待页面使用 Selenium WebDriver for Python 加载

Question

提问by apogne

I want to scrape all the data of a page implemented by a infinite scroll. The following python code works.

我想抓取由无限滚动实现的页面的所有数据。以下 python 代码有效。

for i in range(100):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)

This means every time I scroll down to the bottom, I need to wait 5 seconds, which is generally enough for the page to finish loading the newly generated contents. But, this may not be time efficient. The page may finish loading the new contents within 5 seconds. How can I detect whether the page finished loading the new contents every time I scroll down? If I can detect this, I can scroll down again to see more contents once I know the page finished loading. This is more time efficient.

这意味着每次向下滚动到底部时，我需要等待 5 秒钟，这通常足以让页面完成加载新生成的内容。但是，这可能没有时间效率。页面可能会在 5 秒内完成加载新内容。每次向下滚动时如何检测页面是否完成加载新内容？如果我能检测到这一点，我可以在知道页面加载完成后再次向下滚动以查看更多内容。这样时间效率更高。

Answer 1

采纳答案by Zeinab Abbasimazar

The webdriverwill wait for a page to load by default via .get()method.

该webdriver会通过等待页面加载默认.get()的方法。

As you may be looking for some specific element as @user227215 said, you should use WebDriverWaitto wait for an element located in your page:

正如@user227215 所说，您可能正在寻找某些特定元素，因此您应该使用WebDriverWait等待页面中的元素：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
    myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
    print "Page is ready!"
except TimeoutException:
    print "Loading took too much time!"

I have used it for checking alerts. You can use any other type methods to find the locator.

我用它来检查警报。您可以使用任何其他类型的方法来查找定位器。

EDIT 1:

编辑 1：

I should mention that the webdriverwill wait for a page to load by default. It does not wait for loading inside frames or for ajax requests. It means when you use .get('url'), your browser will wait until the page is completely loaded and then go to the next command in the code. But when you are posting an ajax request, webdriverdoes not wait and it's your responsibility to wait an appropriate amount of time for the page or a part of page to load; so there is a module named expected_conditions.

我应该提到webdriver默认情况下将等待页面加载。它不会等待加载内部帧或 ajax 请求。这意味着当您使用时.get('url')，您的浏览器将等待页面完全加载，然后转到代码中的下一个命令。但是当您发布 ajax 请求时，webdriver不要等待，您有责任等待适当的时间来加载页面或页面的一部分；所以有一个名为expected_conditions.

Answer 2

回答by kenorb

Find below 3 methods:

找到以下3种方法：

readyState

就绪状态

Checking page readyState (not reliable):

检查页面 readyState（不可靠）：

def page_has_loaded(self):
    self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
    page_state = self.driver.execute_script('return document.readyState;')
    return page_state == 'complete'

The wait_forhelper function is good, but unfortunately click_through_to_new_pageis open to the race condition where we manage to execute the script in the old page, before the browser has started processing the click, and page_has_loadedjust returns true straight away.

该wait_for助手功能还是不错的，可惜click_through_to_new_page是开放的，我们管理的旧页面执行脚本的竞争条件，浏览器已经开始处理前点击，并page_has_loaded刚刚返回true，立竿见影。

`id`

Comparing new page ids with the old one:

将新页面 ID 与旧页面 ID 进行比较：

def page_has_loaded_id(self):
    self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
    try:
        new_page = browser.find_element_by_tag_name('html')
        return new_page.id != old_page.id
    except NoSuchElementException:
        return False

It's possible that comparing ids is not as effective as waiting for stale reference exceptions.

比较 id 可能不如等待过时的引用异常有效。

`staleness_of`

Using staleness_ofmethod:

使用staleness_of方法：

@contextlib.contextmanager
def wait_for_page_load(self, timeout=10):
    self.log.debug("Waiting for page to load at {}.".format(self.driver.current_url))
    old_page = self.find_element_by_tag_name('html')
    yield
    WebDriverWait(self, timeout).until(staleness_of(old_page))

For more details, check Harry's blog.

有关更多详细信息，请查看Harry 的博客。

Answer 3

回答by David Cullen

Trying to pass find_element_by_idto the constructor for presence_of_element_located(as shown in the accepted answer) caused NoSuchElementExceptionto be raised. I had to use the syntax in fragles' comment:

试图传递find_element_by_id给构造函数 for presence_of_element_located（如已接受的答案所示）导致NoSuchElementException引发。我不得不在fragles的评论中使用语法：

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Firefox()
driver.get('url')
timeout = 5
try:
    element_present = EC.presence_of_element_located((By.ID, 'element_id'))
    WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
    print "Timed out waiting for page to load"

This matches the example in the documentation. Here is a link to the documentation for By.

这与文档中的示例相匹配。这是By 文档的链接。

Answer 4

回答by J0ANMM

As mentioned in the answer from David Cullen, I've seen always recommended using a line like the following one:

正如David Cullen的回答中提到的，我一直看到建议使用如下一行：

element_present = EC.presence_of_element_located((By.ID, 'element_id'))
WebDriverWait(driver, timeout).until(element_present)

It was difficult for me to find anywhere all possible locators that can be used with the Bysyntax, so I thought it would be useful to provide here the list. According to Web Scraping with Pythonby Ryan Mitchell:

我很难在任何地方找到可以与By语法一起使用的所有可能的定位器，所以我认为在这里提供列表会很有用。根据Ryan Mitchell 的Web Scraping with Python的说法：

ID
Used in the example; finds elements by their HTML id attribute
CLASS_NAME
Used to find elements by their HTML class attribute. Why is this function CLASS_NAMEnot simply CLASS? Using the form object.CLASSwould create problems for Selenium's Java library, where .classis a reserved method. In order to keep the Selenium syntax consistent between different languages, CLASS_NAMEwas used instead.
CSS_SELECTOR
Find elements by their class, id, or tag name, using the #idName, .className, tagNameconvention.
LINK_TEXT
Finds HTML tags by the text they contain. For example, a link that says "Next" can be selected using (By.LINK_TEXT, "Next").
PARTIAL_LINK_TEXT
Similar to LINK_TEXT, but matches on a partial string.
NAME
Finds HTML tags by their name attribute. This is handy for HTML forms.
TAG_NAME
Fins HTML tags by their tag name.
XPATH
Uses an XPath expression ... to select matching elements.

ID
示例中使用；通过 HTML id 属性查找元素
CLASS_NAME
用于通过元素的 HTML 类属性查找元素。为什么这个功能CLASS_NAME不简单CLASS？使用该表单object.CLASS会给 Selenium 的 Java 库带来问题，其中.class是保留方法。为了保持不同语言之间的 Selenium 语法一致，CLASS_NAME改为使用。
CSS_SELECTOR
使用#idName, .className,tagName约定按类、id 或标签名称查找元素。
LINK_TEXT
根据它们包含的文本查找 HTML 标签。例如，可以使用选择显示“下一步”的链接(By.LINK_TEXT, "Next")。
PARTIAL_LINK_TEXT
类似于LINK_TEXT，但匹配部分字符串。
NAME
按名称属性查找 HTML 标签。这对于 HTML 表单很方便。
TAG_NAME
按标签名称排列 HTML 标签。
XPATH
使用 XPath 表达式 ... 来选择匹配的元素。

Answer 5

回答by Carl

From selenium/webdriver/support/wait.py

来自selenium/webdriver/support/wait.py

driver = ...
from selenium.webdriver.support.wait import WebDriverWait
element = WebDriverWait(driver, 10).until(
    lambda x: x.find_element_by_id("someId"))

Answer 6

回答by Rao

How about putting WebDriverWait in While loop and catching the exceptions.

如何将 WebDriverWait 放入 While 循环并捕获异常。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
while True:
    try:
        WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_id('IdOfMyElement')))
        print "Page is ready!"
        break # it will break from the loop once the specific element will be present. 
    except TimeoutException:
        print "Loading took too much time!-Try again"

Answer 7

回答by raffaem

On a side note, instead of scrolling down 100 times, you can check if there are no more modifications to the DOM (we are in the case of the bottom of the page being AJAX lazy-loaded)

在旁注中，您可以检查是否没有对 DOM 进行更多修改，而不是向下滚动 100 次（我们是在页面底部进行 AJAX 延迟加载的情况下）

def scrollDown(driver, value):
    driver.execute_script("window.scrollBy(0,"+str(value)+")")

# Scroll down the page
def scrollDownAllTheWay(driver):
    old_page = driver.page_source
    while True:
        logging.debug("Scrolling loop")
        for i in range(2):
            scrollDown(driver, 500)
            time.sleep(2)
        new_page = driver.page_source
        if new_page != old_page:
            old_page = new_page
        else:
            break
    return True

Answer 8

回答by seeiespi

Have you tried driver.implicitly_wait. It is like a setting for the driver, so you only call it once in the session and it basically tells the driver to wait the given amount of time until each command can be executed.

你有没有试过driver.implicitly_wait。它就像驱动程序的设置，因此您只在会话中调用它一次，它基本上告诉驱动程序等待给定的时间，直到可以执行每个命令。

driver = webdriver.Chrome()
driver.implicitly_wait(10)

So if you set a wait time of 10 seconds it will execute the command as soon as possible, waiting 10 seconds before it gives up. I've used this in similar scroll-down scenarios so I don't see why it wouldn't work in your case. Hope this is helpful.

因此，如果您将等待时间设置为 10 秒，它将尽快执行命令，等待 10 秒后它会放弃。我在类似的向下滚动场景中使用过它，所以我不明白为什么它在你的情况下不起作用。希望这是有帮助的。

To be able to fix this answer, I have to add new text. Be sure to use a lower case 'w' in implicitly_wait.

为了能够修复这个答案，我必须添加新文本。确保在implicitly_wait.

Answer 9

回答by ahmed abdelmalek

Here I did it using a rather simple form:

在这里，我使用了一个相当简单的形式：

from selenium import webdriver
browser = webdriver.Firefox()
browser.get("url")
searchTxt=''
while not searchTxt:
    try:    
      searchTxt=browser.find_element_by_name('NAME OF ELEMENT')
      searchTxt.send_keys("USERNAME")
    except:continue

等待页面使用 Selenium WebDriver for Python 加载

提问by apogne

采纳答案by Zeinab Abbasimazar

回答by kenorb

readyState

就绪状态

`id`

`id`

`staleness_of`

`staleness_of`

回答by David Cullen

回答by J0ANMM

回答by Carl

回答by Rao

回答by raffaem

回答by seeiespi

回答by ahmed abdelmalek

相关推荐

最近更新

标签

等待页面使用 Selenium WebDriver for Python 加载

提问by apogne

采纳答案by Zeinab Abbasimazar

回答by kenorb

readyState

就绪状态

id

id

staleness_of

staleness_of

回答by David Cullen

回答by J0ANMM

回答by Carl

回答by Rao

回答by raffaem

回答by seeiespi

回答by ahmed abdelmalek

相关推荐

Python 如何检查pymongo游标是否有查询结果

Python pySerial write() 不会接受我的字符串

Python 如何使用 select_for_update 在 Django 中“获取”查询？

Python 烧瓶蓝图属性错误：“模块”对象没有属性“名称”错误

相关推荐

最近更新

标签

`id`

`id`

`staleness_of`

`staleness_of`