Javascript 如何使用 PhantomJS 下载 csv 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31564215/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to download a csv file using PhantomJS
提问by MrD
When I'm browsing a website A using normal browser (Chrome) and when I click on a link on the website A, Chrome imediatelly downloads report in a form of CSV file.
当我使用普通浏览器 (Chrome) 浏览网站 A 以及单击网站 A 上的链接时,Chrome 会立即以 CSV 文件的形式下载报告。
When I checked a server response headers I get the following results:
当我检查服务器响应标头时,我得到以下结果:
Cache-Control:private,max-age=31536000
Connection:Keep-Alive
Content-Disposition:attachment; filename="report.csv"
Content-Encoding:gzip
Content-Language:de-DE
Content-Type:text/csv; charset=UTF-8
Date:Wed, 22 Jul 2015 12:44:30 GMT
Expires:Thu, 21 Jul 2016 12:44:30 GMT
Keep-Alive:timeout=15, max=75
Pragma:cache
Server:Apache
Transfer-Encoding:chunked
Vary:Accept-Encoding
Now, I want to download and parse this file using PhantomJS. I set page
onResourceReceived
listener to see if Phantom will receive/download the file.
现在,我想使用 PhantomJS 下载并解析这个文件。我设置了page
onResourceReceived
监听器来查看 Phantom 是否会接收/下载文件。
clientRequests.phantomPage.onResourceReceived = function(response) {
console.log('Response (#' + response.id + ', stage "' + response.stage + '"): ' + JSON.stringify(response));
};
When I make Phantom request to download a file (this is page.open('URL OF THE FILE')), I can see in Phantom log that file is downloaded. Here are logs:
当我发出 Phantom 请求下载文件(这是 page.open('URL OF THE FILE'))时,我可以在 Phantom 日志中看到该文件已下载。以下是日志:
"contentType": "text/csv; charset=UTF-8",
"headers": {
"name": "Date",
"value": "Wed, 22 Jul 2015 12:57:41 GMT"
},
"name": "Content-Disposition",
"value": "attachment; filename=\"report.csv\"",
"status":200,"statusText":"OK"
I received the file and its content, but how to access file data? When I print current PhantomJS page
object, I get the HTML of the page A and I don't want that, I want CSV file, which I need to parse using JavaScript.
我收到了文件及其内容,但如何访问文件数据?当我打印当前的 PhantomJSpage
对象时,我得到了页面 A 的 HTML,我不想要那个,我想要 CSV 文件,我需要使用 JavaScript 解析它。
采纳答案by MrD
After days and days of investigation, I have to say that there are some solutions:
经过几天的调查,不得不说有一些解决办法:
- In your evaluate function you can make AJAX call to download and encode your file, then you can return this content back to phantom script
- You can use some custom Phantom library available on some GitHub pages
- 在您的评估函数中,您可以进行 AJAX 调用来下载和编码您的文件,然后您可以将此内容返回给幻影脚本
- 您可以使用某些 GitHub 页面上提供的一些自定义 Phantom 库
If you need to download a file using PhanotmJS, then run away from PhantomJS and use CasperJS. CasperJS is based on PhantomJS, but it has much better and intuitive syntax and program flow.
如果您需要使用PhantomJS下载文件,请远离 PhantomJS 并使用 CasperJS。CasperJS 基于 PhantomJS,但它具有更好、更直观的语法和程序流程。
Here is good post explaining "Why CasperJS is better than PhantomJS". In this post you can find section about file download.
这是解释“为什么 CasperJS 比 PhantomJS 更好”的好帖子。在这篇文章中,您可以找到有关文件下载的部分。
How to download CSV file using CasperJS (this works even when server sends header Content-Disposition:attachment; filename='file.csv
)
如何使用 CasperJS 下载 CSV 文件(即使服务器发送标头也能正常工作Content-Disposition:attachment; filename='file.csv
)
Here you can find some custom csv file available for download: http://captaincoffee.com.au/dump/items.csv
在这里您可以找到一些可供下载的自定义 csv 文件:http: //captaincoffee.com.au/dump/items.csv
In order to download this file using CasperJS execute the following code:
要使用 CasperJS 下载此文件,请执行以下代码:
var casper = require('casper').create();
casper.start("http://captaincoffee.com.au/dump/", function() {
this.echo(this.getTitle())
});
casper.then(function(){
var url = 'http://captaincoffee.com.au/dump/csv.csv';
require('utils').dump(this.base64encode(url, 'get'));
});
casper.run();
The code above will download http://captaincoffee.com.au/dump/csv.csv
CSV file and will print results as base64 string. So this way, you don't even have to download data to file, you have your data as base64 string.
上面的代码将下载http://captaincoffee.com.au/dump/csv.csv
CSV 文件并将结果打印为 base64 字符串。这样,您甚至不必将数据下载到文件中,您的数据就是 base64 字符串。
If you explicitly want to download file to file system, you can use download
function which is available in CasperJS.
如果您明确想要将文件下载到文件系统,您可以使用download
CasperJS 中提供的函数。
回答by Matthew Lock
I found a solution for PhantomJS. Reading through this discussionI found a jsfiddlewhich downloads a url via jQuery's ajax method and encodes the file as base64.
我找到了 PhantomJS 的解决方案。通过阅读这个讨论,我发现了一个jsfiddle,它通过 jQuery 的 ajax 方法下载一个 url,并将文件编码为 base64。
The file I wanted to download was plain text (CSV) so I have removed the encoding functions. My target page also already had jQuery included so I didn't need to inject jQuery into the target page.
我想下载的文件是纯文本 (CSV),所以我删除了编码功能。我的目标页面也已经包含了 jQuery,所以我不需要将 jQuery 注入目标页面。
My code assumes you have already opened the page you want to download the file from using PhantomJS, and that page has jQuery in it. In my case I had to first login to the site in order to get the download link.
我的代码假设您已经使用 PhantomJS 打开了要下载文件的页面,并且该页面中包含 jQuery。就我而言,我必须先登录该站点才能获得下载链接。
var fs = require('fs');
var page=this;
var result = page.evaluate(function() {
var out;
$.ajax({
'async' : false,
'url' : 'fullurltodownload.csv',
'success' : function(data, status, xhr) {
out = data;
}
});
return out;
});
fs.write('mydownloadedfile.csv', result);
回答by Silas S. Brown
The previous 2 answers assume you can know in advance the URL of the final CSV file. That won't be the case if the link goes to an HTML page that does a Javascript-computed redirect to the file and you don't want to evaluate that Javascript outside of PhantomJS. Your options then are:
前 2 个答案假设您可以提前知道最终 CSV 文件的 URL。如果链接转到执行 Javascript 计算重定向到文件的 HTML 页面,并且您不想在 PhantomJS 之外评估该 Javascript,情况就不会如此。您的选择是:
- put PhantomJS behind an upstream proxy, and use said upstream proxy to intercept the download URL (and its expected Cookie and Referer headers)—but you'd have to be careful to positively identify the real download URL and not some random data 'blob' if the page makes binary XMLHttpRequests as well;
- instead of PhantomJS use Headless Chrome which can automatically save downloaded files (or Firefox with PyVirtualDisplay, which can also be set to do this, or wait for Headless Firefox) and monitor the downloads directory—but you'd have to be able to figure out by yourself when the download has completed (or use an upstream proxy to monitor it for completion, but Headless Chrome/Firefox cannot currently be set to ignore SSL certificates, which means if the site goes "secure" it's much more difficult to monitor the requests of Headless Chrome/Firefox than it is to monitor the requests of PhantomJS, at least until Chromium issue 721739is fixed; you could watch a CONNECT request but if it's kept alive you will have no way of knowing for sure that a transfer has finished);
- put PhantomJS behind an upstream proxy that changes all unknown content types to
text/plain
and deletesContent-Disposition
headers, so you can read the file from PhantomJS in the normal way—that should work for a CSV file but won't work for binaries with 0-bytes in them.
- 将 PhantomJS 置于上游代理之后,并使用所述上游代理拦截下载 URL(及其预期的 Cookie 和 Referer 标头)——但您必须小心识别真实的下载 URL 而不是一些随机数据“blob”如果页面也生成二进制 XMLHttpRequests;
- 而不是 PhantomJS 使用 Headless Chrome,它可以自动保存下载的文件(或带有 PyVirtualDisplay 的 Firefox,也可以设置为这样做,或者等待 Headless Firefox)并监控下载目录 - 但你必须能够弄清楚下载完成后由您自己(或使用上游代理来监控它是否完成,但 Headless Chrome/Firefox 目前无法设置为忽略 SSL 证书,这意味着如果站点变得“安全”,则监控请求会更加困难Headless Chrome/Firefox 的作用,而不是监视 PhantomJS 的请求,至少在Chromium 问题 721739被修复之前;您可以观看 CONNECT 请求,但如果它保持活动状态,您将无法确定传输已完成) ;
- 将 PhantomJS 置于上游代理之后,该代理将所有未知内容类型更改为
text/plain
并删除Content-Disposition
标头,因此您可以以正常方式从 PhantomJS 读取文件——这应该适用于 CSV 文件,但不适用于其中包含 0 字节的二进制文件.
The first of these options (PhantomJS + upstream proxy) is made easier if the upstream proxy can monitor the Accept
header that PhantomJS sends to the remote site. At least in PhantomJS version 2.1.1, main requests have Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
, stylesheet requests have Accept: text/css,*/*;q=0.1
, and all other requests (images, scripts, XMLHttpRequest) default to Accept: */*
although this can be overridden by sites that use XMLHttpRequest.setRequestHeader()
. Therefore if the upstream proxy sees a request with an Accept
header containing text/html
, and passing on thisrequest to the server results in a CSV file or other non-HTML document, then there's a good chance this is the one to save.
如果上游代理可以监视Accept
PhantomJS 发送到远程站点的标头,则第一个选项(PhantomJS + 上游代理)会更容易。至少在 PhantomJS 版本 2.1.1 中,主要请求有Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
,样式表请求有Accept: text/css,*/*;q=0.1
,所有其他请求(图像、脚本、XMLHttpRequest)默认为,Accept: */*
尽管这可以被使用XMLHttpRequest.setRequestHeader()
. 因此,如果上游代理看到一个请求Accept
头包含text/html
,并且将此请求传递给服务器会生成一个 CSV 文件或其他非 HTML 文档,那么很有可能这是要保存的。