Python Scrapy 和代理

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4710483/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 17:08:53  来源:igfitidea点击:

Scrapy and proxies

pythonscrapy

提问by no1

How do you utilize proxy support with the python web-scraping framework Scrapy?

你如何利用 python 网络抓取框架 Scrapy 的代理支持?

采纳答案by ephemient

From the Scrapy FAQ,

Scrapy FAQ

Does Scrapy work with HTTP proxies?

Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware.

Scrapy 是否适用于 HTTP 代理?

是的。通过 HTTP 代理下载器中间件提供对 HTTP 代理的支持(自 Scrapy 0.8 起)。见HttpProxyMiddleware

The easiest way to use a proxy is to set the environment variable http_proxy. How this is done depends on your shell.

使用代理的最简单方法是设置环境变量http_proxy。这是如何完成的取决于您的外壳。

C:\>set http_proxy=http://proxy:port
csh% setenv http_proxy http://proxy:port
sh$ export http_proxy=http://proxy:port

if you want to use https proxy and visited https web,to set the environment variable http_proxyyou should follow below,

如果您想使用 https 代理并访问 https 网站,http_proxy请按照以下步骤设置环境变量,

C:\>set https_proxy=https://proxy:port
csh% setenv https_proxy https://proxy:port
sh$ export https_proxy=https://proxy:port

回答by laurent alsina

that would be:

那将是:

export http_proxy=http://user:password@proxy:port

导出 http_proxy=http://user:password@proxy:port

回答by Amom

Single Proxy

单一代理

  1. Enable HttpProxyMiddlewarein your settings.py, like this:

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1
    }
    
  2. pass proxy to request via request.meta:

    request = Request(url="http://example.com")
    request.meta['proxy'] = "host:port"
    yield request
    
  1. HttpProxyMiddleware在您的 中启用settings.py,如下所示:

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1
    }
    
  2. 通过代理请求通过request.meta

    request = Request(url="http://example.com")
    request.meta['proxy'] = "host:port"
    yield request
    

You also can choose a proxy address randomly if you have an address pool. Like this:

如果你有地址池,也可以随机选择一个代理地址。像这样:

Multiple Proxies

多个代理

class MySpider(BaseSpider):
    name = "my_spider"
    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN']

    def parse(self, response):
        ...parse code...
        if something:
            yield self.get_request(url)

    def get_request(self, url):
        req = Request(url=url)
        if self.proxy_pool:
            req.meta['proxy'] = random.choice(self.proxy_pool)
        return req

回答by Shahryar Saljoughi

1-Create a new file called “middlewares.py” and save it in your scrapy project and add the following code to it.

1-创建一个名为“middlewares.py”的新文件并将其保存在您的scrapy项目中,并向其中添加以下代码。

import base64
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2 – Open your project's configuration file (./project_name/settings.py) and add the following code

2 – 打开您项目的配置文件(./project_name/settings.py)并添加以下代码

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'project_name.middlewares.ProxyMiddleware': 100,
}

Now, your requests should be passed by this proxy. Simple, isn't it ?

现在,您的请求应该由此代理传递。很简单,不是吗?

回答by Andrea Ianni ?

In Windows I put together a couple of previous answers and it worked. I simply did:

在 Windows 中,我整理了几个以前的答案,并且奏效了。我只是做了:

C:>  set http_proxy = http://username:password@proxy:port

and then I launched my program:

然后我启动了我的程序:

C:/.../RightFolder> scrapy crawl dmoz

where "dmzo" is the program name (I'm writing it because it's the one you find in a tutorial on internet, and if you're here you have probably started from the tutorial).

其中“dmzo”是程序名称(我写它是因为它是您在 Internet 上的教程中找到的名称,如果您在这里,您可能已经从教程开始)。

回答by Andrea Ianni ?

As I've had trouble by setting the environment in /etc/environment, here is what I've put in my spider (Python):

由于我在 /etc/environment 中设置环境时遇到了麻烦,以下是我在蜘蛛(Python)中放入的内容:

os.environ["http_proxy"] = "http://localhost:12345"

回答by Niranjan Sagar

There is nice middleware written by someone [1]: https://github.com/aivarsk/scrapy-proxies"Scrapy proxy middleware"

有人写了一个很好的中间件 [1]:https: //github.com/aivarsk/scrapy-proxies“Scrapy 代理中间件”

回答by Amit

I would recommend you to use a middleware such as scrapy-proxies. You can rotate proxies, filter bad proxies or use a single proxyfor all your request. Also,using a middleware will save you the trouble of setting up proxy on every run.

我建议您使用诸如scrapy-proxies 之类的中间件。您可以轮换代理、过滤不良代理或对所有请求使用单个代理。此外,使用中间件可以省去在每次运行时设置代理的麻烦。

This is directly from the GitHub README.

这直接来自 GitHub README。

  • Install the scrapy-rotating-proxy library

    pip install scrapy_proxies

  • In your settings.py add the following settings

  • 安装scrapy-rotating-proxy 库

    pip install scrapy_proxies

  • 在 settings.py 中添加以下设置

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"

Here you can change retry times, set a single or rotating proxy

您可以在此处更改重试次数、设置单个或轮换代理

  • Then add your proxy to a list.txt file like this
  • 然后将您的代理添加到这样的 list.txt 文件中
http://host1:port
http://username:password@host2:port
http://host3:port

After this all your requests for that project will be sent through proxy. Proxy is rotated for every request randomly. It will not affect concurrency.

在此之后,您对该项目的所有请求都将通过代理发送。代理为每个请求随机轮换。不会影响并发。

Note: if you donot want to use proxy. You can simply comment the scrapy_proxymiddleware line.

注意:如果您不想使用代理。您可以简单地评论scrapy_proxy中间件行。

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
#    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

Happy crawling!!!

爬行快乐!!!