Python Scrapy 和代理

Question

提问by no1

How do you utilize proxy support with the python web-scraping framework Scrapy?

你如何利用 python 网络抓取框架 Scrapy 的代理支持？

Answer 1

采纳答案by ephemient

Does Scrapy work with HTTP proxies?
Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware.

Scrapy 是否适用于 HTTP 代理？
是的。通过 HTTP 代理下载器中间件提供对 HTTP 代理的支持（自 Scrapy 0.8 起）。见HttpProxyMiddleware。

The easiest way to use a proxy is to set the environment variable http_proxy. How this is done depends on your shell.

使用代理的最简单方法是设置环境变量http_proxy。这是如何完成的取决于您的外壳。

C:\>set http_proxy=http://proxy:port
csh% setenv http_proxy http://proxy:port
sh$ export http_proxy=http://proxy:port

if you want to use https proxy and visited https web,to set the environment variable http_proxyyou should follow below,

如果您想使用 https 代理并访问 https 网站，http_proxy请按照以下步骤设置环境变量，

C:\>set https_proxy=https://proxy:port
csh% setenv https_proxy https://proxy:port
sh$ export https_proxy=https://proxy:port

Answer 2

回答by laurent alsina

that would be:

那将是：

export http_proxy=http://user:password@proxy:port

导出 http_proxy=http://user:password@proxy:port

Answer 3

回答by Amom

Single Proxy

单一代理

Enable HttpProxyMiddlewarein your settings.py, like this:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1
}

pass proxy to request via request.meta:

request = Request(url="http://example.com")
request.meta['proxy'] = "host:port"
yield request

HttpProxyMiddleware在您的中启用settings.py，如下所示：

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1
}

通过代理请求通过request.meta：

request = Request(url="http://example.com")
request.meta['proxy'] = "host:port"
yield request

You also can choose a proxy address randomly if you have an address pool. Like this:

如果你有地址池，也可以随机选择一个代理地址。像这样：

Multiple Proxies

多个代理

class MySpider(BaseSpider):
    name = "my_spider"
    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN']

    def parse(self, response):
        ...parse code...
        if something:
            yield self.get_request(url)

    def get_request(self, url):
        req = Request(url=url)
        if self.proxy_pool:
            req.meta['proxy'] = random.choice(self.proxy_pool)
        return req

Answer 4

回答by Shahryar Saljoughi

1-Create a new file called “middlewares.py” and save it in your scrapy project and add the following code to it.

1-创建一个名为“middlewares.py”的新文件并将其保存在您的scrapy项目中，并向其中添加以下代码。

import base64
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2 – Open your project's configuration file (./project_name/settings.py) and add the following code

2 – 打开您项目的配置文件（./project_name/settings.py）并添加以下代码

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'project_name.middlewares.ProxyMiddleware': 100,
}

Now, your requests should be passed by this proxy. Simple, isn't it ?

现在，您的请求应该由此代理传递。很简单，不是吗？

Answer 5

回答by Andrea Ianni ?

In Windows I put together a couple of previous answers and it worked. I simply did:

在 Windows 中，我整理了几个以前的答案，并且奏效了。我只是做了：

C:>  set http_proxy = http://username:password@proxy:port

and then I launched my program:

然后我启动了我的程序：

C:/.../RightFolder> scrapy crawl dmoz

where "dmzo" is the program name (I'm writing it because it's the one you find in a tutorial on internet, and if you're here you have probably started from the tutorial).

其中“dmzo”是程序名称（我写它是因为它是您在 Internet 上的教程中找到的名称，如果您在这里，您可能已经从教程开始）。

Answer 6

回答by Andrea Ianni ?

As I've had trouble by setting the environment in /etc/environment, here is what I've put in my spider (Python):

由于我在 /etc/environment 中设置环境时遇到了麻烦，以下是我在蜘蛛（Python）中放入的内容：

os.environ["http_proxy"] = "http://localhost:12345"

Answer 7

回答by Niranjan Sagar

There is nice middleware written by someone [1]: https://github.com/aivarsk/scrapy-proxies"Scrapy proxy middleware"

有人写了一个很好的中间件 [1]：https: //github.com/aivarsk/scrapy-proxies“Scrapy 代理中间件”

Answer 8

回答by Amit

I would recommend you to use a middleware such as scrapy-proxies. You can rotate proxies, filter bad proxies or use a single proxyfor all your request. Also,using a middleware will save you the trouble of setting up proxy on every run.

我建议您使用诸如scrapy-proxies 之类的中间件。您可以轮换代理、过滤不良代理或对所有请求使用单个代理。此外，使用中间件可以省去在每次运行时设置代理的麻烦。

This is directly from the GitHub README.

这直接来自 GitHub README。

Install the scrapy-rotating-proxy library
pip install scrapy_proxies
In your settings.py add the following settings

安装scrapy-rotating-proxy 库
pip install scrapy_proxies
在 settings.py 中添加以下设置

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"

Here you can change retry times, set a single or rotating proxy

您可以在此处更改重试次数、设置单个或轮换代理

Then add your proxy to a list.txt file like this

然后将您的代理添加到这样的 list.txt 文件中

http://host1:port
http://username:password@host2:port
http://host3:port

After this all your requests for that project will be sent through proxy. Proxy is rotated for every request randomly. It will not affect concurrency.

在此之后，您对该项目的所有请求都将通过代理发送。代理为每个请求随机轮换。不会影响并发。

Note: if you donot want to use proxy. You can simply comment the scrapy_proxymiddleware line.

注意：如果您不想使用代理。您可以简单地评论scrapy_proxy中间件行。

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
#    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

Happy crawling!!!

爬行快乐！！！

Python Scrapy 和代理

提问by no1

采纳答案by ephemient

Does Scrapy work with HTTP proxies?

Scrapy 是否适用于 HTTP 代理？

回答by laurent alsina

回答by Amom

回答by Shahryar Saljoughi

回答by Andrea Ianni ?

回答by Andrea Ianni ?

回答by Niranjan Sagar

回答by Amit

相关推荐

最近更新

标签

Python Scrapy 和代理

提问by no1

采纳答案by ephemient

Does Scrapy work with HTTP proxies?

Scrapy 是否适用于 HTTP 代理？

回答by laurent alsina

回答by Amom

回答by Shahryar Saljoughi

回答by Andrea Ianni ?

回答by Andrea Ianni ?

回答by Niranjan Sagar

回答by Amit

相关推荐

Python 遍历 xml 元素的有效方法

ftp.retrbinary() 帮助 python

Python：按分隔符列表拆分字符串

如何使用python检查字符串中的字母是否大写？

相关推荐

最近更新

标签