Scrapy Python 设置用户代理
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18920930/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scrapy Python Set up User Agent
提问by B.Mr.W.
I tried to override the user-agent of my crawlspider by adding an extra line to the project configuration file. Here is the code:
我试图通过在项目配置文件中添加额外的一行来覆盖我的 crawlspider 的用户代理。这是代码:
[settings]
default = myproject.settings
USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
[deploy]
#url = http://localhost:6800/
project = myproject
But when I run the crawler against my own web, I notice the spider did not pick up my customized user agent but the default one "Scrapy/0.18.2 (+http://scrapy.org)". Can any one explain what I have done wrong.
但是当我在自己的网络上运行爬虫时,我注意到蜘蛛没有选择我定制的用户代理,而是默认的“Scrapy/0.18.2 (+ http://scrapy.org)”。任何人都可以解释我做错了什么。
Note:
笔记:
(1). It works when I tried to override the user agent globally:
scrapy crawl myproject.com -o output.csv -t csv -s USER_AGENT="Mozilla...."
(2). When I remove the line "default = myproject.setting" from the configuration file, and run scrapy crawl myproject.com, it says "cannot find spider..", so I feel like the default setting should not be removed in this case.
(2). 当我从配置文件中删除“default = myproject.setting”行,并运行scrapy crawl myproject.com时,它说“找不到蜘蛛..”,所以我觉得在这种情况下不应该删除默认设置。
Thanks a lot for the help in advance.
非常感谢您的帮助。
采纳答案by paul trmbrth
Move your USER_AGENT line to the settings.py
file, and not in your scrapy.cfg
file. settings.py
should be at same level as items.py
if you use scrapy startproject
command, in your case it should be something like myproject/settings.py
将您的 USER_AGENT 行移动到settings.py
文件中,而不是在您的scrapy.cfg
文件中。settings.py
应该与items.py
您使用scrapy startproject
命令处于同一级别,在您的情况下,它应该类似于myproject/settings.py
回答by Jéter Silveira
I had the same problem. Try running your spider as superuser. I was running the spider directly with the command "scrapy runspider", when I just tried executing it with "sudo scrapy runspider" it worked.
我有同样的问题。尝试以超级用户身份运行您的蜘蛛。我直接使用命令“scrapy runpider”运行蜘蛛,当我尝试使用“sudo scrapy runpider”执行它时,它起作用了。
回答by Bletch
Just in case anyone lands here that manually controls the scrapy crawl. i.e. you do notuse the scrapy crawl process from the shell...
以防万一有人在这里手动控制scrapy crawl。即您不使用 shell 中的 scrapy 爬网过程...
$ scrapy crawl myproject
But insted you use CrawlerProcess()
or CrawlerRunner()
...
但是你使用CrawlerProcess()
或CrawlerRunner()
...
process = CrawlerProcess()
or
或者
process = CrawlerRunner()
then the user agent, along with other settings, can be passed to the crawler in a dictionary of configuration variables.
然后用户代理以及其他设置可以在配置变量字典中传递给爬虫。
Like this...
像这样...
process = CrawlerProcess(
{
'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
}
)