Python 如何使用 PyCharm 调试 Scrapy 项目

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21788939/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:37:13  来源:igfitidea点击:

How to use PyCharm to debug Scrapy projects

pythondebuggingpython-2.7scrapypycharm

提问by William Kinaan

I am working on Scrapy 0.20 with Python 2.7. I found PyCharm has a good Python debugger. I want to test my Scrapy spiders using it. Anyone knows how to do that please?

我正在使用 Python 2.7 开发 Scrapy 0.20。我发现 PyCharm 有一个很好的 Python 调试器。我想用它来测试我的 Scrapy 蜘蛛。请问有人知道怎么做吗?

What I have tried

我试过的

Actually I tried to run the spider as a scrip. As a result, I built that scrip. Then, I tried to add my Scrapy project to PyCharm as a model like this:

实际上,我试图将蜘蛛作为脚本运行。结果,我构建了那个脚本。然后,我尝试将我的 Scrapy 项目作为这样的模型添加到 PyCharm:

File->Setting->Project structure->Add content root.

But I don't know what else I have to do

但我不知道我还能做什么

采纳答案by Pullie

The scrapycommand is a python script which means you can start it from inside PyCharm.

scrapy命令是一个 python 脚本,这意味着您可以从 PyCharm 内部启动它。

When you examine the scrapy binary (which scrapy) you will notice that this is actually a python script:

当您检查 scrapy 二进制文件 ( which scrapy) 时,您会注意到这实际上是一个 Python 脚本:

#!/usr/bin/python

from scrapy.cmdline import execute
execute()

This means that a command like scrapy crawl IcecatCrawlercan also be executed like this: python /Library/Python/2.7/site-packages/scrapy/cmdline.py crawl IcecatCrawler

这意味着像这样的命令 scrapy crawl IcecatCrawler也可以这样执行:python /Library/Python/2.7/site-packages/scrapy/cmdline.py crawl IcecatCrawler

Try to find the scrapy.cmdline package. In my case the location was here: /Library/Python/2.7/site-packages/scrapy/cmdline.py

尝试找到scrapy.cmdline 包。在我的情况下,位置在这里:/Library/Python/2.7/site-packages/scrapy/cmdline.py

Create a run/debug configuration inside PyCharm with that script as script. Fill the script parameters with the scrapy command and spider. In this case crawl IcecatCrawler.

使用该脚本作为脚本在 PyCharm 中创建运行/调试配置。用scrapy 命令和spider 填充脚本参数。在这种情况下crawl IcecatCrawler

Like this: PyCharm Run/Debug Configuration

像这样: PyCharm 运行/调试配置

Put your breakpoints anywhere in your crawling code and it should work?.

将断点放在爬行代码中的任何位置,它应该可以工作吗?。

回答by warvariuc

I am also using PyCharm, but I am not using its built-in debugging features.

我也在使用 PyCharm,但我没有使用它的内置调试功能。

For debugging I am using ipdb. I set up a keyboard shortcut to insert import ipdb; ipdb.set_trace()on any line I want the break point to happen.

为了调试,我正在使用ipdb. 我设置了一个键盘快捷键来插入import ipdb; ipdb.set_trace()我希望断点发生的任何行。

Then I can type nto execute the next statement, sto step in a function, type any object name to see its value, alter execution environment, type cto continue execution...

然后我可以输入n执行下s一条语句,单步执行函数,输入任何对象名称以查看其值,更改执行环境,输入c继续执行...

This is very flexible, works in environments other than PyCharm, where you don't control the execution environment.

这非常灵活,适用于 PyCharm 以外的环境,您无法控制执行环境。

Just type in your virtual environment pip install ipdband place import ipdb; ipdb.set_trace()on a line where you want the execution to pause.

只需输入您的虚拟环境pip install ipdb并将其放置import ipdb; ipdb.set_trace()在您希望暂停执行的行上。

回答by Rodrigo

You just need to do this.

你只需要这样做。

Create a Python file on crawler folder on your project. I used main.py.

在项目的 crawler 文件夹中创建一个 Python 文件。我使用了 main.py。

  • Project
    • Crawler
      • Crawler
        • Spiders
        • ...
      • main.py
      • scrapy.cfg
  • 项目
    • 履带式
      • 履带式
        • 蜘蛛
        • ...
      • 主文件
      • 配置文件

Inside your main.py put this code below.

在您的 main.py 中,将此代码放在下面。

from scrapy import cmdline    
cmdline.execute("scrapy crawl spider".split())

And you need to create a "Run Configuration" to run your main.py.

你需要创建一个“运行配置”来运行你的 main.py。

Doing this, if you put a breakpoint at your code it will stop there.

这样做,如果你在你的代码上放置一个断点,它就会停在那里。

回答by taylor

To add a bit to the accepted answer, after almost an hour I found I had to select the correct Run Configuration from the dropdown list (near the center of the icon toolbar), then click the Debug button in order to get it to work. Hope this helps!

为了给接受的答案添加一点,将近一个小时后,我发现我必须从下拉列表(靠近图标工具栏的中心)中选择正确的运行配置,然后单击调试按钮以使其工作。希望这可以帮助!

回答by rioted

I am running scrapy in a virtualenv with Python 3.5.0 and setting the "script" parameter to /path_to_project_env/env/bin/scrapysolved the issue for me.

我正在使用 Python 3.5.0 在 vi​​rtualenv 中运行 scrapy 并设置“script”参数来/path_to_project_env/env/bin/scrapy解决我的问题。

回答by LuciferHyman

intellij ideaalso work.

intellij 的想法也行。

create main.py:

创建main.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#coding=utf-8
import sys
from scrapy import cmdline
def main(name):
    if name:
        cmdline.execute(name.split())



if __name__ == '__main__':
    print('[*] beginning main thread')
    name = "scrapy crawl stack"
    #name = "scrapy crawl spa"
    main(name)
    print('[*] main thread exited')
    print('main stop====================================================')

show below:

显示如下:

enter image description here

在此处输入图片说明

enter image description here

在此处输入图片说明

enter image description here

在此处输入图片说明

回答by berardino

According to the documentation https://doc.scrapy.org/en/latest/topics/practices.html

根据文档https://doc.scrapy.org/en/latest/topics/practices.html

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

回答by Rutger de Knijf

As of 2018.1 this became a lot easier. You can now select Module namein your project's Run/Debug Configuration. Set this to scrapy.cmdlineand the Working directoryto the root dir of the scrapy project (the one with settings.pyin it).

从 2018.1 开始,这变得容易多了。您现在可以Module name在项目的Run/Debug Configuration. 将此设置为scrapy.cmdline和设置Working directory为scrapy项目的根目录(其中包含settings.py的那个)。

Like so:

像这样:

PyCharm Scrapy debug configuration

PyCharm Scrapy 调试配置

Now you can add breakpoints to debug your code.

现在您可以添加断点来调试您的代码。

回答by gangabass

I use this simple script:

我使用这个简单的脚本:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

process.crawl('your_spider_name')
process.start()

回答by Muhammad Haseeb

Extending @Rodrigo's version of the answer I added this script and now I can set spider name from configuration instead of changing in the string.

扩展@Rodrigo 版本的答案,我添加了这个脚本,现在我可以从配置中设置蜘蛛名称,而不是在字符串中进行更改。

import sys
from scrapy import cmdline

cmdline.execute(f"scrapy crawl {sys.argv[1]}".split())