javascript 如何执行未经身份验证的 Instagram 网页抓取以响应最近的私有 API 更改?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49786980/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 08:45:40  来源:igfitidea点击:

How to perform unauthenticated Instagram web scraping in response to recent private API changes?

javascriptweb-scrapinginstagraminstagram-api

提问by ReactingToAngularVues

Months ago, Instagram began rendering their public API inoperable by removing most features and refusing to accept new applications for most permissions scopes. Further changes were made this weekwhich further constricts developer options.

几个月前,Instagram 通过删除大多数功能并拒绝接受大多数权限范围的新应用程序,开始使他们的公共 API 无法运行。本周进行了进一步的更改,这进一步限制了开发人员的选择。

Many of us have turned to Instagram's private web API to implement the functionality we previously had. One standout ping/instagram_private_apimanages to rebuild most of the prior functionality, however, with the publicly announced changes this week, Instagram also made underlying changes to their private API, requiring in magic variables, user-agents, and MD5 hashing to make web scraping requests possible. This can be seen by following the recent releases on the previously linked git repository, and the exact changes needed to continue fetching data can be seen here.

我们中的许多人已经转向 Instagram 的私有网络 API 来实现我们以前拥有的功能。一个出色的ping/instagram_private_api设法重建了大部分先前的功能,但是,随着本周公开宣布的更改,Instagram 还对其私有 API 进行了底层更改,需要使用魔术变量、用户代理和 MD5 哈希来进行网络抓取请求可能。这可以通过关注先前链接的 git 存储库上的最新版本来查看,并且可以在此处查看继续获取数据所需的确切更改。

These changes include:

这些变化包括:

  • Persisting the User Agent & CSRF token between requests.
  • Making an initial request to https://instagram.com/to grab an rhx_gismagic key from the response body.
  • Setting the X-Instagram-GISheader, which is formed by magically concatenating the rhx_giskey and query variables before passing them through an MD5 hash.
  • 在请求之间保留用户代理和 CSRF 令牌。
  • 发出初始请求以从响应正文中https://instagram.com/获取rhx_gis魔法密钥。
  • 设置X-Instagram-GIS标头,它是通过rhx_gis在通过 MD5 散列传递它们之前神奇地连接键和查询变量而形成的。

Anything less than this will result in a 403 error. These changes have been implemented successfully in the above repository, however, my attempt in JS continues to fail. In the below code, I am attempting to fetch the first 9 posts from a user timeline. The query parameters which determine this are:

任何小于此值的值都会导致 403 错误。这些更改已在上述存储库中成功实施,但是,我在 JS 中的尝试仍然失败。在下面的代码中,我试图从用户时间轴中获取前 9 个帖子。确定这一点的查询参数是:

  • query_hashof 42323d64886122307be10013ad2dcc44(fetch media from the user's timeline).
  • variables.idof any user ID as a string (the user to fetch media from).
  • variables.first, the number of posts to fetch, as an integer.
  • query_hashof 42323d64886122307be10013ad2dcc44(从用户的时间轴获取媒体)。
  • variables.id任何用户 ID 的字符串(要从中获取媒体的用户)。
  • variables.first,要获取的帖子数,作为整数。

Previously, this request could be made without any of the above changes by simply GETting from https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables=%7B%22id%22%3A%225380311726%22%2C%22first%22%3A1%7D, as the URL was unprotected.

以前,https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables=%7B%22id%22%3A%225380311726%22%2C%22first%22%3A1%7D由于 URL 未受保护,因此可以通过简单地从 GETting 来发出此请求,而无需进行上述任何更改。

However, my attempt at implementing the functionality to successfully written in the above repository is not working, and I only receive 403 responses from Instagram. I'm using superagent as my requests library, in a node environment.

但是,我尝试实现在上述存储库中成功编写的功能不起作用,我只收到来自 Instagram 的 403 响应。我在节点环境中使用 superagent 作为我的请求库。

/*
** Retrieve an arbitrary cookie value by a given key.
*/
const getCookieValueFromKey = function(key, cookies) {
        const cookie = cookies.find(c => c.indexOf(key) !== -1);
        if (!cookie) {
            throw new Error('No key found.');
        }
        return (RegExp(key + '=(.*?);', 'g').exec(cookie))[1];
    };

/*
** Calculate the value of the X-Instagram-GIS header by md5 hashing together the rhx_gis variable and the query variables for the request.
*/
const generateRequestSignature = function(rhxGis, queryVariables) {
    return crypto.createHash('md5').update(`${rhxGis}:${queryVariables}`, 'utf8').digest("hex");
};

/*
** Begin
*/
const userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/604.3.5 (KHTML, like Gecko) Version/11.0.1 Safari/604.3.5';

// Make an initial request to get the rhx_gis string
const initResponse = await superagent.get('https://www.instagram.com/');
const rhxGis = (RegExp('"rhx_gis":"([a-f0-9]{32})"', 'g')).exec(initResponse.text)[1];

const csrfTokenCookie = getCookieValueFromKey('csrftoken', initResponse.header['set-cookie']);

const queryVariables = JSON.stringify({
    id: "123456789",
    first: 9
});

const signature = generateRequestSignature(rhxGis, queryVariables);

const res = await superagent.get('https://www.instagram.com/graphql/query/')
    .query({
        query_hash: '42323d64886122307be10013ad2dcc44',
        variables: queryVariables
    })
    .set({
        'User-Agent': userAgent,
        'X-Instagram-GIS': signature,
        'Cookie': `rur=FRC;csrftoken=${csrfTokenCookie};ig_pr=1`
    }));

What else should I try? What makes my code fail, and the provided code in the repository above work just fine?

我还应该尝试什么?是什么让我的代码失败,而上面存储库中提供的代码工作正常?

Update (2018-04-17)

更新 (2018-04-17)

For at least the 3rd time in a week, Instagram has again updated their API. The change no longer requires the CSRF Token to form part of the hashed signature.

至少一周内第三次,Instagram 再次更新了他们的 API。更改不再需要 CSRF 令牌形成散列签名的一部分。

The question above has been updated to reflect this.

上面的问题已更新以反映这一点。

Update (2018-04-14)

更新 (2018-04-14)

Instagram has again updated their private graphql API. As far as anyone can figure out:

Instagram 再次更新了他们的私有 graphql API。至于任何人都可以弄清楚:

  • User Agent is no longer needed to be included in the X-Instagram-Gismd5 calculation.
  • 用户代理不再需要包含在X-Instagram-Gismd5 计算中。

The question above has been updated to reflect this.

上面的问题已更新以反映这一点。

采纳答案by Alex

Values to persist

坚持的价值观

You aren't persisting the User Agent (a requirement) in the first query to Instagram:

您不会在对 Instagram 的第一个查询中保留用户代理(一项要求):

const initResponse = await superagent.get('https://www.instagram.com/');

Should be:

应该:

const initResponse = await superagent.get('https://www.instagram.com/')
                     .set('User-Agent', userAgent);

This must be persisted in each request, along with the csrftokencookie.

这必须与csrftokencookie一起保留在每个请求中。

X-Instagram-GIS header generation

X-Instagram-GIS 标头生成

As your answer shows, you must generate the X-Instagram-GISheader from two properties, the rhx_gisvalue which is found in your initial request, and the query variables in your next request. These must be md5 hashed, as shown in your function above:

正如您的回答所示,您必须X-Instagram-GIS从两个属性生成标头,rhx_gis在初始请求中找到的值以及在下一个请求中的查询变量。这些必须是 md5 散列,如上面的函数所示:

const generateRequestSignature = function(rhxGis, queryVariables) {
    return crypto.createHash('md5').update(`${rhxGis}:${queryVariables}`, 'utf8').digest("hex");
};

回答by olllejik

So in order to call instagram query you need to generate x-instagram-gisheader.

因此,为了调用 instagram 查询,您需要生成x-instagram-gis标头。

To generate this header you need to calculate a md5 hash of the next string "{rhx_gis}:{path}". The rhx_gisvalue is stored in the source code of instagram page in the window._sharedDataglobal js variable.

要生成此标头,您需要计算下一个字符串“ {rhx_gis}:{path}”的 md5 哈希值。所述rhx_gis值被存储在的Instagram页中的源代码window._sharedData全局JS变量。

Example:
If you try to GET user info request like this https://www.instagram.com/{username}/?__a=1
You need to add http header x-instagram-gisto request which value is
MD5("{rhx_gis}:/{username}/")

示例:
如果您尝试 GET 这样的用户信息请求https://www.instagram.com/{username}/?__a=1
您需要添加 http 标头x-instagram-gis以请求哪个值是
MD5("{rhx_gis}:/{username}/")

This is tested and works 100%, so feel free to ask if something goes wrong.

这已经过测试并且 100% 有效,因此请随时询问是否出现问题。

回答by Gianluca

Uhm... I don't have Node installed on my machine, so I cannot verify for sure, but looks like to me that you are missing a crucial part of the parameters in querystring, that is the afterfield:

嗯...我的机器上没有安装 Node,所以我无法确定,但在我看来,您缺少查询字符串中参数的关键部分,即after字段:

const queryVariables = JSON.stringify({
    id: "123456789",
    first: 4,
    after: "YOUR_END_CURSOR"
});

From those queryVariablesdepend your MD5 hash, that, then, doesn't match the expected one. Try that: I expect it to work.

从那些queryVariables取决于您的 MD5 哈希值,然后,与预期的不匹配。试试看:我希望它能奏效。

EDIT:

编辑:

Reading carefully your code, it doesn't make much sense unfortunately. I infer that you are trying to fetch the full stream of pictures from a user's feed.

仔细阅读你的代码,不幸的是它没有多大意义。我推断您正在尝试从用户的提要中获取完整的图片流。

Then, what you need to do is notcalling the Instagram home page as you are doing now (superagent.get('https://www.instagram.com/')), but rather the user's stream (superagent.get('https://www.instagram.com/your_user')).

然后,您需要做的不是像现在一样调用 Instagram 主页 ( superagent.get('https://www.instagram.com/')),而是用户的流 ( superagent.get('https://www.instagram.com/your_user'))。

Beware: you need to hardcode the very same user agent you're going to use below (and it doesn't look like you are...).

当心:您需要对您将在下面使用的相同用户代理进行硬编码(并且看起来不像您......)。

Then, you need to extract the query ID (it's nothardcoded, it changes every few hours, sometimes minutes; hardcoding it is foolish?–?however, for this POC, you can keep it hardcoded), and the end_cursor. For the end cursor I'd go for something like this:

然后,您需要提取查询 ID(它不是硬编码的,它每隔几个小时,有时几分钟就会改变;硬编码是愚蠢的?–?但是,对于这个 POC,您可以保留它的硬编码)和 end_cursor。对于结束光标,我会这样做:

const endCursor = (RegExp('end_cursor":"([^"]*)"', 'g')).exec(initResponse.text)[1];

Now you have everything you need to make the secondrequest:

现在您拥有了发出第二个请求所需的一切:

const queryVariables = JSON.stringify({
    id: "123456789",
    first: 9,
    after: endCursor
});

const signature = generateRequestSignature(rhxGis, csrfTokenCookie, queryVariables);

const res = await superagent.get('https://www.instagram.com/graphql/query/')
    .query({
        query_hash: '42323d64886122307be10013ad2dcc44',
        variables: queryVariables
    })
    .set({
        'User-Agent': userAgent,
        'Accept': '*/*',
        'Accept-Language': 'en-US',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'close',
        'X-Instagram-GIS': signature,
        'Cookie': `rur=${rurCookie};csrftoken=${csrfTokenCookie};mid=${midCookie};ig_pr=1`
    }).send();

回答by inDream

query_hashis not constant and keep changing over time.

query_hash不是一成不变的,并且会随着时间不断变化。

For example ProfilePage scripts included these scripts:

例如 ProfilePage 脚本包括这些脚本:

https://www.instagram.com/static/bundles/base/ConsumerCommons.js/9e645e0f38c3.jshttps://www.instagram.com/static/bundles/base/Consumer.js/1c9217689868.js

https://www.instagram.com/static/bundles/base/ConsumerCommons.js/9e645e0f38c3.js https://www.instagram.com/static/bundles/base/Consumer.js/1c9217689868.js

The hash is located in one of the above script, e.g. for edge_followed_by:

哈希位于上述脚本之一中,例如edge_followed_by

const res = await fetch(scriptUrl, { credentials: 'include' });
const rawBody = await res.text();
const body = rawBody.slice(0, rawBody.lastIndexOf('edge_followed_by'));
const hashes = body.match(/"\w{32}"/g);
// hashes[hashes.length - 2]; = edge_followed_by
// hashes[hashes.length - 1]; = edge_follow