Java Selenium - driver.getPageSource() 与从浏览器查看的源不同

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19358658/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 16:27:58  来源:igfitidea点击:

Selenium - driver.getPageSource() differs than the source viewed from browser

javafirefoxseleniumwebdriverselenium-webdriver

提问by roger_that

I am trying to capture the source code from the URL specified into an HTML file using selenium, but I don't know why, I am not getting the exact source code which we see from the browser.

我正在尝试使用 selenium 从指定的 URL 中捕获源代码到 HTML 文件中,但我不知道为什么,我没有得到我们从浏览器看到的确切源代码。

Below is my java code to capture the source in an HTML file

下面是我在 HTML 文件中捕获源代码的 Java 代码

private static void getHTMLSourceFromURL(String url, String fileName) {

    WebDriver driver = new FirefoxDriver();
    driver.get(url);

    try {
        Thread.sleep(5000);   //the page gets loaded completely

        List<String> pageSource = new ArrayList<String>(Arrays.asList(driver.getPageSource().split("\n")));

        writeTextToFile(pageSource, originalFile);

    } catch (InterruptedException e) {
        e.printStackTrace();
    }

    System.out.println("quitting webdriver");
    driver.quit();
}

/**
 * creates file with fileName and writes the content
 * 
 * @param content
 * @param fileName
 */
private static void writeTextToFile(List<String> content, String fileName) {
    PrintWriter pw = null;
    String outputFolder = ".";
    File output = null;
    try {
        File dir = new File(outputFolder + '/' + "HTML Sources");
        if (!dir.exists()) {
            boolean success = dir.mkdirs();
            if (success == false) {
                try {
                    throw new Exception(dir + " could not be created");
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }

        output = new File(dir + "/" + fileName);
        if (!output.exists()) {
            try {
                output.createNewFile();
            } catch (IOException ioe) {
                ioe.printStackTrace();
            }
        }
        pw = new PrintWriter(new FileWriter(output, true));
        for (String line : content) {
            pw.print(line);
            pw.print("\n");
        }
    } catch (IOException ioe) {
        ioe.printStackTrace();
    } finally {
        pw.close();
    }

}

Can someone throw some light into this as to why this happens? How WebDriver renders the page? And how browser shows the source?

有人可以解释为什么会发生这种情况吗?WebDriver 如何呈现页面?以及浏览器如何显示源代码?

回答by Madusudanan

There are several places where you can get the source from.You can try

有几个地方可以获取源码。你可以试试

String pageSource=driver.findElement(By.tagName("body")).getText();

and see what comes up.

看看会发生什么。

Generally you do not need to wait for the page to load.Selenium does that automatically,unless you have separate sections of Javascript/Ajax.

通常,您不需要等待页面加载。Selenium 会自动执行此操作,除非您有单独的 Javascript/Ajax 部分。

You might want to add what are the differences that you are seeing, so that we can understand what you really mean.

您可能想要添加您所看到的差异,以便我们能够理解您的真正意思。

Webdriver does not render the page on its own,it just renders it as the browser sees it.

Webdriver 不会自己呈现页面,它只是在浏览器看到它时呈现它。

回答by mikemelon

I encountered the same problem. I use these code to solve it:

我遇到了同样的问题。我使用这些代码来解决它:

......
String javascript = "return arguments[0].innerHTML";
String pageSource=(String)(JavascriptExecutor)driver)
    .executeScript(javascript, driver.findElement(By.tagName("html")));
pageSource = "<html>"+pageSource +"</html>";
System.out.println(pageSource);
//FileUtils.write(new File("e:\test.html"), pageSource,);
......

By using JavaScript code to get the innerHTML property, it finally works, and the question marks disappeared.

通过使用JavaScript代码获取innerHTML属性,终于成功了,问号也消失了。

回答by Indigenuity

The "source" code you get from Selenium seems to not be the source at all. It seems to be the HTML for the current DOM. The source code you see in the browser is the HTML as given by the server, before any dynamic changes made to it by JavaScript. If the DOM changes at all, the browser source code doesn't reflect those changes, but Selenium will. If you want to see the current DOM in a browser, you'd use the developer tools, not the source code.

您从 Selenium 获得的“源”代码似乎根本不是源代码。它似乎是当前 DOM 的 HTML。您在浏览器中看到的源代码是服务器提供的 HTML,在 JavaScript 对其进行任何动态更改之前。如果 DOM 发生了变化,浏览器源代码不会反映这些变化,但 Selenium 会。如果您想在浏览器中查看当前 DOM,您将使用开发人员工具,而不是源代码。