如何使用python和beautifulsoup抓取需要登录的网站?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23102833/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:14:07  来源:igfitidea点击:

How to scrape a website which requires login using python and beautifulsoup?

pythonweb-scrapingbeautifulsoup

提问by user781486

If I want to scrape a website that requires login with password first, how can I start scraping it with python using beautifulsoup4 library? Below is what I do for websites that do not require login.

如果我想抓取一个需要先用密码登录的网站,我如何使用beautifulsoup4库开始用python抓取它?以下是我对不需要登录的网站所做的工作。

from bs4 import BeautifulSoup    
import urllib2 
url = urllib2.urlopen("http://www.python.org")    
content = url.read()    
soup = BeautifulSoup(content)

How should the code be changed to accommodate login? Assume that the website I want to scrape is a forum that requires login. An example is http://forum.arduino.cc/index.php

应该如何更改代码以适应登录?假设我要抓取的网站是一个需要登录的论坛。一个例子是http://forum.arduino.cc/index.php

采纳答案by 4d4c

You can use mechanize:

您可以使用机械化:

import mechanize
from bs4 import BeautifulSoup
import urllib2 
import cookielib

cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")

br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()

print br.response().read()

Or urllib - Login to website using urllib2

或 urllib -使用 urllib2 登录网站

回答by user8495890

You can use selenium to log in and retrieve the page source, which you can then pass to Beautiful Soup to extract the data you want.

您可以使用 selenium 登录并检索页面源,然后您可以将其传递给 Beautiful Soup 以提取您想要的数据。

回答by Plabon Dutta

If you go for selenium, then you can do something like below:

如果您选择硒,那么您可以执行以下操作:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait

# If you want to open Chrome
driver = webdriver.Chrome()
# If you want to open Firefox
driver = webdriver.Firefox()

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("YourUsername")
password.send_keys("YourPassword")
driver.find_element_by_id("submit_btn").click()

However, if you're adamant that you're only going to use BeautifulSoup, you can do that with a library like requestsor urllib. Basically all you have to do is POSTthe data as a payload with the URL.

但是,如果您坚持只使用 BeautifulSoup,则可以使用requests或 之类的库来实现urllib。基本上,您所要做的就是POST将数据作为带有 URL 的有效负载。

import requests
from bs4 import BeautifulSoup

login_url = 'http://example.com/login'
data = {
    'username': 'your_username',
    'password': 'your_password'
}

with requests.Session() as s:
    response = requests.post(login_url , data)
    print(response.text)
    index_page= s.get('http://example.com')
    soup = BeautifulSoup(index_page.text, 'html.parser')
    print(soup.title)

回答by Adelin

There is a simpler way, from my pov, that gets you there without seleniumor mechanize, or other 3rd party tools, albeit it is semi-automated.

从我的观点来看,有一种更简单的方法可以让您在没有seleniummechanize或其他第 3 方工具的情况下到达那里,尽管它是自动化的。

Basically, when you login into a site in a normal way, you identify yourself in a unique way using your credentials, and the same identity is used thereafter for every other interaction, which is stored in cookiesand headers, for a brief period of time.

基本上,当您以正常方式登录站点时,您会使用您的凭据以独特的方式识别自己,此后的所有其他交互都会使用相同的身份,这些身份会在cookies和 中存储headers一段时间。

What you need to do is use the same cookiesand headerswhen you make your http requests, and you'll be in.

你需要做的就是使用相同的cookiesheaders当你让你的HTTP请求,你会英寸

To replicate that, follow these steps:

要复制它,请按照下列步骤操作:

  1. In your browser, open the developer tools
  2. Go to the site, and login
  3. Afterthe login, go to the network tab, and thenrefreshthe page
    At this point, you should see a list of requests, the top one being the actual site - and that will be our focus, because it contains the data with the identity we can use for Python and BeautifulSoup to scrape it
  4. Right click the site request (the top one), hover over copy, and then copy as cURL
    Like this:
  1. 在浏览器中,打开开发者工具
  2. 进入网站,并登录
  3. 登录,进入网络选项卡,然后刷新页面
    。在这一点上,你应该看到请求的列表,上面一个是实际的网站-这将是我们的重点,因为它包含了与身份数据我们可以使用 Python 和 BeautifulSoup 来抓取它
  4. 右键单击站点请求(最上面的一个),将鼠标悬停在 上copy,然后copy as cURL
    像这样:

enter image description here

在此处输入图片说明

  1. Then go to this site which converts cURL into python requests: https://curl.trillworks.com/
  2. Take the python code and use the generated cookiesand headersto proceed with the scraping
  1. 然后转到这个将 cURL 转换为 python 请求的站点:https: //curl.trillworks.com/
  2. 获取python代码并使用生成的cookiesheaders继续抓取

回答by LuxZg

Since Python version wasn't specified, here is my take on it for Python 3, done without any external libraries (StackOverflow). After login use BeautifulSoup as usual, or any other kind of scraping.

由于没有指定 Python 版本,这里是我对 Python 3 的看法,在没有任何外部库 (StackOverflow) 的情况下完成。登录后像往常一样使用 BeautifulSoup,或任何其他类型的抓取。

Likewise, script on my GitHub here

同样,这里是我 GitHub 上的脚本

Whole script replicated below as to StackOverflow guidelines:

根据 StackOverflow 指南,整个脚本复制如下:

# Login to website using just Python 3 Standard Library
import urllib.parse
import urllib.request
import http.cookiejar

def scraper_login():
    ####### change variables here, like URL, action URL, user, pass
    # your base URL here, will be used for headers and such, with and without https://
    base_url = 'www.example.com'
    https_base_url = 'https://' + base_url

    # here goes URL that's found inside form action='.....'
    #   adjust as needed, can be all kinds of weird stuff
    authentication_url = https_base_url + '/login'

    # username and password for login
    username = 'yourusername'
    password = 'SoMePassw0rd!'

    # we will use this string to confirm a login at end
    check_string = 'Logout'

    ####### rest of the script is logic
    # but you will need to tweak couple things maybe regarding "token" logic
    #   (can be _token or token or _token_ or secret ... etc)

    # big thing! you need a referer for most pages! and correct headers are the key
    headers={"Content-Type":"application/x-www-form-urlencoded",
    "User-agent":"Mozilla/5.0 Chrome/81.0.4044.92",    # Chrome 80+ as per web search
    "Host":base_url,
    "Origin":https_base_url,
    "Referer":https_base_url}

    # initiate the cookie jar (using : http.cookiejar and urllib.request)
    cookie_jar = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
    urllib.request.install_opener(opener)

    # first a simple request, just to get login page and parse out the token
    #       (using : urllib.request)
    request = urllib.request.Request(https_base_url)
    response = urllib.request.urlopen(request)
    contents = response.read()

    # parse the page, we look for token eg. on my page it was something like this:
    #    <input type="hidden" name="_token" value="random1234567890qwertzstring">
    #       this can probably be done better with regex and similar
    #       but I'm newb, so bear with me
    html = contents.decode("utf-8")
    # text just before start and just after end of your token string
    mark_start = '<input type="hidden" name="_token" value="'
    mark_end = '">'
    # index of those two points
    start_index = html.find(mark_start) + len(mark_start)
    end_index = html.find(mark_end, start_index)
    # and text between them is our token, store it for second step of actual login
    token = html[start_index:end_index]

    # here we craft our payload, it's all the form fields, including HIDDEN fields!
    #   that includes token we scraped earler, as that's usually in hidden fields
    #   make sure left side is from "name" attributes of the form,
    #       and right side is what you want to post as "value"
    #   and for hidden fields make sure you replicate the expected answer,
    #       eg. "token" or "yes I agree" checkboxes and such
    payload = {
        '_token':token,
    #    'name':'value',    # make sure this is the format of all additional fields !
        'login':username,
        'password':password
    }

    # now we prepare all we need for login
    #   data - with our payload (user/pass/token) urlencoded and encoded as bytes
    data = urllib.parse.urlencode(payload)
    binary_data = data.encode('UTF-8')
    # and put the URL + encoded data + correct headers into our POST request
    #   btw, despite what I thought it is automatically treated as POST
    #   I guess because of byte encoded data field you don't need to say it like this:
    #       urllib.request.Request(authentication_url, binary_data, headers, method='POST')
    request = urllib.request.Request(authentication_url, binary_data, headers)
    response = urllib.request.urlopen(request)
    contents = response.read()

    # just for kicks, we confirm some element in the page that's secure behind the login
    #   we use a particular string we know only occurs after login,
    #   like "logout" or "welcome" or "member", etc. I found "Logout" is pretty safe so far
    contents = contents.decode("utf-8")
    index = contents.find(check_string)
    # if we find it
    if index != -1:
        print(f"We found '{check_string}' at index position : {index}")
    else:
        print(f"String '{check_string}' was not found! Maybe we did not login ?!")

scraper_login()