Scrapy Python 设置用户代理

Question

提问by B.Mr.W.

I tried to override the user-agent of my crawlspider by adding an extra line to the project configuration file. Here is the code:

我试图通过在项目配置文件中添加额外的一行来覆盖我的 crawlspider 的用户代理。这是代码：

[settings]
default = myproject.settings
USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"


[deploy]
#url = http://localhost:6800/
project = myproject

But when I run the crawler against my own web, I notice the spider did not pick up my customized user agent but the default one "Scrapy/0.18.2 (+http://scrapy.org)". Can any one explain what I have done wrong.

但是当我在自己的网络上运行爬虫时，我注意到蜘蛛没有选择我定制的用户代理，而是默认的“Scrapy/0.18.2 (+ http://scrapy.org)”。任何人都可以解释我做错了什么。

Note:

笔记：

(1). It works when I tried to override the user agent globally:

(1). 当我尝试全局覆盖用户代理时，它起作用：

scrapy crawl myproject.com -o output.csv -t csv -s USER_AGENT="Mozilla...."

(2). When I remove the line "default = myproject.setting" from the configuration file, and run scrapy crawl myproject.com, it says "cannot find spider..", so I feel like the default setting should not be removed in this case.

(2). 当我从配置文件中删除“default = myproject.setting”行，并运行scrapy crawl myproject.com时，它说“找不到蜘蛛..”，所以我觉得在这种情况下不应该删除默认设置。

Thanks a lot for the help in advance.

非常感谢您的帮助。

Answer 1

采纳答案by paul trmbrth

Move your USER_AGENT line to the settings.pyfile, and not in your scrapy.cfgfile. settings.pyshould be at same level as items.pyif you use scrapy startprojectcommand, in your case it should be something like myproject/settings.py

将您的 USER_AGENT 行移动到settings.py文件中，而不是在您的scrapy.cfg文件中。settings.py应该与items.py您使用scrapy startproject命令处于同一级别，在您的情况下，它应该类似于myproject/settings.py

Answer 2

回答by Jéter Silveira

I had the same problem. Try running your spider as superuser. I was running the spider directly with the command "scrapy runspider", when I just tried executing it with "sudo scrapy runspider" it worked.

我有同样的问题。尝试以超级用户身份运行您的蜘蛛。我直接使用命令“scrapy runpider”运行蜘蛛，当我尝试使用“sudo scrapy runpider”执行它时，它起作用了。

Answer 3

回答by Bletch

Just in case anyone lands here that manually controls the scrapy crawl. i.e. you do notuse the scrapy crawl process from the shell...

以防万一有人在这里手动控制scrapy crawl。即您不使用 shell 中的 scrapy 爬网过程...

$ scrapy crawl myproject

But insted you use CrawlerProcess()or CrawlerRunner()...

但是你使用CrawlerProcess()或CrawlerRunner()...

process = CrawlerProcess()

or

或者

process = CrawlerRunner()

then the user agent, along with other settings, can be passed to the crawler in a dictionary of configuration variables.

然后用户代理以及其他设置可以在配置变量字典中传递给爬虫。

Like this...

像这样...

    process = CrawlerProcess(
            {
                'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
            }
    )

Scrapy Python 设置用户代理

提问by B.Mr.W.

采纳答案by paul trmbrth

回答by Jéter Silveira

回答by Bletch

相关推荐

最近更新

标签

Scrapy Python 设置用户代理

提问by B.Mr.W.

采纳答案by paul trmbrth

回答by Jéter Silveira

回答by Bletch

相关推荐

Python 如何在pyqt中更改Qtablewidget的特定单元格背景颜色

Python 如何通过pip安装mysql-connector

Python 如何使用word2vec找到最接近向量的单词

Python 从 Pandas 中具有多个值的列创建虚拟对象

相关推荐

最近更新

标签