使用 Python 从 Facebook 抓取数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19041827/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:37:49  来源:igfitidea点击:

Scraping Data from Facebook with Python

pythonfacebookweb-scrapingbeautifulsoupmechanize

提问by cscanlin

I've been trying for several day now (unsuccessfully) to scrape cities from about 500 Facebook URLs. However, Facebook handles its data in a very strange way and I can't figure out what's going on under the hood to understand what I need to do.

我已经尝试了几天(未成功)从大约 500 个 Facebook URL 中抓取城市。然而,Facebook 以一种非常奇怪的方式处理它的数据,我无法弄清楚幕后发生了什么来理解我需要做什么。

Essentially the problem is that Facebook displays very different amounts of data depending on who is logged in, and what the privacy settings of the account are. For instance, try opening the following three links, both in a browser where you are logged into Facebook, and one where you are not:

本质上,问题在于 Facebook 显示的数据量大不相同,具体取决于登录者以及帐户的隐私设置。例如,尝试在您登录 Facebook 的浏览器和未登录 Facebook 的浏览器中打开以下三个链接:

As you can see, Facebook loads the data in both cases for the first link, but only gets data for the second link if you are logged in (to ANY account). The third link displays city when you are logged in, but only displays other information when you are not.

如您所见,Facebook 在两种情况下都加载了第一个链接的数据,但如果您登录(到任何帐户),则只会获取第二个链接的数据。第三个链接在您登录时显示城市,但在您未登录时仅显示其他信息。

The reason this is extremely problematic (and related to Python) is that when trying to scrape the page with Beautiful Soup or Mechanize, I cannot figure out how to get the program to "pretend" that I am logged into an account. This means that I can easily grab data off the first type of link (of which there are less than 10), but I cannot get city off the second or third type. So far I've tried a number of solutions with little success.

这是非常有问题的(并且与 Python 相关)的原因是,当尝试使用 Beautiful Soup 或 Mechanize 抓取页面时,我无法弄清楚如何让程序“假装”我登录了一个帐户。这意味着我可以轻松地从第一种类型的链接(其中少于 10 个)中获取数据,但我无法从第二种或第三种类型中获取城市。到目前为止,我已经尝试了许多解决方案,但收效甚微。

Here's some sample code that works correctly for the first type, but not for other types:

下面是一些示例代码,它适用于第一种类型,但不适用于其他类型:

import mechanize
import re
import csv

user_info = []

fb_url = 'http://www.facebook.com/100004210542493'
br = mechanize.Browser()
br.set_handle_robots(False)

br.open(fb_url)
all_html = br.response().get_data()
print all_html

city = re.search('fsl fwb fcb">(.+?)</a></div><div class="aboutSubtitle fsm fwn fcg', all_html).group(1)

user_info = [fb_url, city]
print user_info

I also have a version that uses Beautiful Soup. If anyone has any ideas on how to get around this, I would be extremely grateful. Thank you!

我也有一个使用 Beautiful Soup 的版本。如果有人对如何解决这个问题有任何想法,我将不胜感激。谢谢!

采纳答案by James Robinson

The rightway to do this is to use the facebook API. For various business, security, and privacy reasons they go out of their way to make scraping data tricky.

正确的做到这一点的方法是使用Facebook的API。出于各种业务、安全和隐私原因,他们不遗余力地使抓取数据变得棘手。

If you insist on scraping I would try to log in first using mechanize to submit the form. I've never tried to do this with facebook, but alot of websites have easier to parse versions intended for mobile users at m.site.com.

如果您坚持抓取,我会尝试先使用 mechanize 登录以提交表单。我从来没有尝试过用 facebook 来做这个,但是很多网站在 m.site.com 上有更容易解析为移动用户设计的版本。

回答by Rohit

You should look into using facepyby Johannes Gorset. He has done a brilliant job. I used it when I worked on a small Facebook app for a personal project.

你应该考虑使用facepy约翰内斯Gorset。他做得非常出色。我在为个人项目开发小型 Facebook 应用程序时使用了它。

回答by shashivs

You can try using selenium and Facebook API. I also had to scrape some similar data from list of testing Facebook accounts and selenium webdriver helped to emulate as real user and to scrape the required data.

您可以尝试使用 selenium 和 Facebook API。我还必须从测试 Facebook 帐户和 selenium webdriver 的列表中抓取一些类似的数据,帮助模拟真实用户并抓取所需的数据。

回答by TNT

I think scraping data from facebook is illegal. It is there in the terms of using facebook. Every activity is registered with your login details, even when you use a bot to scrape. If caught, they can ban you from using facebook for your lifetime. If there is a potential threat to any asset that you may pose, they can penalize you further.

我认为从 Facebook 抓取数据是非法的。就使用facebook而言,它就在那里。每个活动都注册了您的登录详细信息,即使您使用机器人抓取也是如此。如果被抓住,他们可以禁止您终生使用 Facebook。如果您可能对任何资产构成潜在威胁,他们可能会进一步惩罚您。