Javascript 如何使用 node.js 抓取需要身份验证的站点?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8726079/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-24 07:13:01  来源:igfitidea点击:

How can I scrape sites that require authentication using node.js?

javascriptnode.jsauthenticationweb-scraping

提问by ekanna

I've come across many tutorials explaining how to scrape public websites that don't require authentication/login, using node.js.

我遇到过很多教程,解释了如何使用 node.js抓取不需要身份验证/登录的公共网站

Can somebody explain how to scrape sites that require login using node.js?

有人可以解释如何使用 node.js 抓取需要登录的网站吗?

回答by alessioalex

Use Mikeal's Requestlibrary, you need to enable cookies support like this:

使用Mikeal 的 Request库,您需要像这样启用 cookie 支持:

var request = request.defaults({jar: true})

So you first should create a username on that site (manually) and pass the username and the password as params when making the POST request to that site. After that the server will respond with a cookie which Request will remember, so you will be able to access the pages that require you to be logged into that site.

因此,您首先应该在该站点上(手动)创建一个用户名,并在向该站点发出 POST 请求时将用户名和密码作为参数传递。之后,服务器将使用 Request 会记住的 cookie 进行响应,因此您将能够访问需要您登录该站点的页面。

Note: this approach doesn't work if something like reCaptcha is used on the login page.

注意:如果在登录页面上使用了 reCaptcha 之类的内容,则此方法不起作用。

回答by mikemaccana

Or using superagent:

或者使用superagent

var superagent = require('superagent')
var agent = superagent.agent();

agentis then a persistent browser, which will handle getting and setting cookies, referers, etc. Just agent.get, agent.post()as normal.

agent那么一个持久的浏览器,这将处理获取,设置cookies,参照网址等只是agent.getagent.post()为正常。

回答by Fabian

I've been working with NodeJs Scrapers for more than 2 years now

我已经使用 NodeJs Scrapers超过 2 年了

I can tell you that the best choice when dealing with logins and authentication is to NOT use direct request

我可以告诉你,处理登录和身份验证的最佳选择是不要使用直接请求

That is because you just waste time on building manual requests and it is way slower,

那是因为您只是在构建手动请求上浪费时间,而且速度较慢,

Instead, use a high lever browser that you control via an API like Puppeteeror NightmareJs

相反,使用您通过PuppeteerNightmareJs等 API 控制的高级浏览器

I have a good starter and in-depth guide on How to start scraping with Puppeteer, I'm sure it will help!

我有一个很好的入门指南和关于如何开始使用 Puppeteer 抓取的深入指南,我相信它会有所帮助!

回答by Usman Yousaf

You can scrape the data from sites that require authenticationlike csrf token.

您可以从需要身份验证的站点(csrf token )中抓取数据

Using the cookiesfor each request like this:

像这样为每个请求使用cookie

var j = request.jar(); // this is to set the jar of request for session and cookie persistence

request = request.defaults({ jar: j }); //here we are setting the default cookies of request

Here is small code to elaborate it further:

这是进一步阐述它的小代码:

var express = require('express');
var bodyParser = require('body-parser');
var querystring = require('querystring');
var request = require('request'); //npm request package to send a get and post request to a url
const cheerio = require('cheerio'); //npm package used for scraping content from third party sites
var cookieParser = require('cookie-parser')
var http = require('http');
var app = express();
app.use(cookieParser());

var _csrf; //variable to store the _csrf value to be used later

app.use(bodyParser.json());
var html = '';

var j = request.jar(); // this is to set the jar of request for session and cookie persistence
request = request.defaults({ jar: j }); //here we are setting the default cookies of request


//___________________API CALL TO VERIFY THE GMS NUMBER_______________________
app.get('/check', function(req, response) {

    var schemeId = null;
    if (req.query.schemeId) {
        schemeId = req.query.schemeId;
        console.log(schemeId);

    } else {
        response.send('false');
        response.end();
    }
    getCsrfValue(function(err, res) {
        if (!err) {
            _csrf = res;
            console.log(_csrf);

            request.post({
                headers: {
                    'Authorization': '',
                    'Content-Type': 'application/x-www-form-urlencoded',
                },
                uri: 'https://www.xyz.site',

                body: "schemeId=" + schemeId + "&_csrf=" + _csrf

            }, function(err, res, body) {
                if (err) {
                    console.log(err);
                } else {
                    console.log("body of post: " + res.body);

                    const $ = cheerio.load(body.toString());
                    var txt = $('.schemeCheckResult').text();

                    console.log(txt);
                    if (txt) {
                        response.send('true');
                    } else {

                        response.send('false');
                    }
                    html += body;
                }
            });

        } else {
            response.send(err);
        }

    })


});

//______________FUNCTION TO SCRAPE THE CSRF TOKEN FROM THE SITE____________
function getCsrfValue(callback) {
    request.get({
        headers: {
            'Authorization': '',
            'Content-Type': 'application/x-www-form-urlencoded',
        },
        uri: 'https://www.xyz.site'

    }, function(err, res, body) {
        if (err) {
            return callback(err);
        } else {
            const $ = cheerio.load(body.toString());
            var txt = $('input[name=_csrf]').val();
            _csrf = txt;

            return callback(null, _csrf);
        }
    });

}

module.exports = app;