python 用于网页抓取的旋转代理
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1934088/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Rotating Proxies for web scraping
提问by Jacob
I've got a python web crawler and I want to distribute the download requests among many different proxy servers, probably running squid (though I'm open to alternatives). For example, it could work in a round-robin fashion, where request1 goes to proxy1, request2 to proxy2, and eventually looping back around. Any idea how to set this up?
我有一个 python 网络爬虫,我想在许多不同的代理服务器之间分发下载请求,可能运行鱿鱼(尽管我愿意接受替代方案)。例如,它可以以循环方式工作,其中 request1 转到 proxy1,request2 转到 proxy2,并最终循环返回。知道如何设置吗?
To make it harder, I'd also like to be able to dynamically change the list of available proxies, bring some down, and add others.
更难的是,我还希望能够动态更改可用代理列表,删除一些并添加其他代理。
If it matters, IP addresses are assigned dynamically.
如果重要,IP 地址是动态分配的。
Thanks :)
谢谢 :)
采纳答案by Bernd
Make your crawler have a list of proxies and with each HTTP request let it use the next proxy from the list in a round robin fashion. However, this will prevent you from using HTTP/1.1 persistent connections. Modifying the proxy list will eventually result in using a new or not using a proxy.
让你的爬虫有一个代理列表,并在每个 HTTP 请求中让它以循环方式使用列表中的下一个代理。但是,这将阻止您使用 HTTP/1.1 持久连接。修改代理列表最终会导致使用新代理或不使用代理。
Or have several connections open in parallel, one to each proxy, and distribute your crawling requests to each of the open connections. Dynamics may be implemented by having the connetor registering itself with the request dispatcher.
或者并行打开多个连接,每个代理一个,并将您的抓取请求分发到每个打开的连接。动态可以通过让连接器向请求分派器注册自身来实现。
回答by sw.
I've setted up rotating proxies using HAProxy + DeleGate + Multiple Tor Instances. With Tor you don't have good control of bandwidth and latency but it's useful for web scraping. I've just published an article on the subject: Running Your Own Anonymous Rotating Proxies
我已经使用 HAProxy + DeleGate + Multiple Tor Instances 设置了轮换代理。使用 Tor,您无法很好地控制带宽和延迟,但它对网络抓取很有用。我刚刚发表了一篇关于这个主题的文章:运行你自己的匿名旋转代理
回答by Andrey E
Edit: There is even Python wrapper for gimmeproxy: https://github.com/ericfourrier/gimmeproxy-api
编辑:gimmeproxy 甚至还有 Python 包装器:https: //github.com/ericfourrier/gimmeproxy-api
If you don't mind Node, you can use proxy-liststo collect public proxies and check-proxyto check them. It's exactly how https://gimmeproxy.comworks, more info here
如果你不介意 Node,你可以使用proxy-lists来收集公共代理并使用check-proxy来检查它们。这正是https://gimmeproxy.com 的工作原理,更多信息请点击此处