Python 请求 URL 中缺少方案

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21103533/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 22:04:29  来源:igfitidea点击:

Missing scheme in request URL

pythonurlscrapy

提问by Toby

I've been stuck on this bug for a while, the following error message is as follows:

被这个bug困扰了一段时间,报错信息如下:

File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\http\request\__init__.py", line 61, in _set_url
            raise ValueError('Missing scheme in request url: %s' % self._url)
            exceptions.ValueError: Missing scheme in request url: h

Scrapy code:

爬虫代码:

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import Selector
    from scrapy.http import Request
    from spyder.items import SypderItem

    import sys
    import MySQLdb
    import hashlib
    from scrapy import signals
    from scrapy.xlib.pydispatch import dispatcher

    # _*_ coding: utf-8 _*_

    class some_Spyder(CrawlSpider):
        name = "spyder"

        def __init__(self, *a, **kw):
            # catch the spider stopping
            # dispatcher.connect(self.spider_closed, signals.spider_closed)
            # dispatcher.connect(self.on_engine_stopped, signals.engine_stopped)

            self.allowed_domains = "domainname.com"
            self.start_urls = "http://www.domainname.com/"
            self.xpaths = '''//td[@class="CatBg" and @width="25%" 
                          and @valign="top" and @align="center"]
                          /table[@cellspacing="0"]//tr/td/a/@href'''

            self.rules = (
                Rule(SgmlLinkExtractor(restrict_xpaths=(self.xpaths))),
                Rule(SgmlLinkExtractor(allow=('cart.php?')), callback='parse_items'),
                )

            super(spyder, self).__init__(*a, **kw)

        def parse_items(self, response):
            sel = Selector(response)
            items = []
            listings = sel.xpath('//*[@id="tabContent"]/table/tr')

            item = IgeItem()
            item["header"] = sel.xpath('//td[@valign="center"]/h1/text()')

            items.append(item)
            return items

I'm pretty sure it's something to do with the URL I'm asking scrapy to follow in the LinkExtractor. When extracting them in shell they looking something like this:

我很确定这与我要求scrapy在LinkExtractor中遵循的URL有关。在 shell 中提取它们时,它们看起来像这样:

data=u'cart.php?target=category&category_id=826'

Compared to another URL extracted from a working spider:

与从工作蜘蛛中提取的另一个 URL 相比:

data=u'/path/someotherpath/category.php?query=someval'

I've had a look at a few questions on Stack Overflow, such as Downloading pictures with scrapybut from reading it I think I may have a slightly different problem.

我查看了 Stack Overflow 上的一些问题,例如Downloadingpictures with scrapy但从阅读它我想我可能有一个稍微不同的问题。

I also took a look at this - http://static.scrapy.org/coverage-report/scrapy_http_request___init__.html

我也看看这个 - http://static.scrapy.org/coverage-report/scrapy_http_request____init__.html

Which explains that the error is thrown up if self.URLs is missing a ":", from looking at the start_urls I've defined I can't quite see why this error would show since the scheme is clearly defined.

这解释了如果 self.URLs 缺少“:”,则会引发错误,从查看我定义的 start_urls 我不太明白为什么会显示此错误,因为该方案已明确定义。

采纳答案by Guy Gavriely

change start_urlsto:

更改start_urls为:

self.start_urls = ["http://www.bankofwow.com/"]

回答by rich tier

prepend url with 'http' or 'https'

在 url 前加上 'http' 或 'https'

回答by paul trmbrth

As @Guy answered earlier, start_urlsattribute must be a list, the exceptions.ValueError: Missing scheme in request url: hmessage comes from that: the "h" in the error message is the first character of "http://www.bankofwow.com/", interpreted as a list (of characters)

正如@Guy 之前回答的,start_urls属性必须是一个列表,exceptions.ValueError: Missing scheme in request url: h消息来自:错误消息中的“ h”是“ http://www.bankofwow.com/”的第一个字符,解释为列表(人物)

allowed_domainsmust also be a list of domains, otherwise you'll get filtered "offsite" requests.

allowed_domains还必须是域列表,否则您将收到过滤的“异地”请求。

Change restrict_xpathsto

更改restrict_xpaths

self.xpaths = """//td[@class="CatBg" and @width="25%" 
                    and @valign="top" and @align="center"]
                   /table[@cellspacing="0"]//tr/td"""

it should represent an area in the document where to find links, it should not be link URLs directly

它应该代表文档中可以找到链接的区域,它不应该直接是链接 URL

From http://doc.scrapy.org/en/latest/topics/link-extractors.html#sgmllinkextractor

来自http://doc.scrapy.org/en/latest/topics/link-extractors.html#sgmllinkextractor

restrict_xpaths (str or list) – is a XPath (or list of XPath's) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links.

restrict_xpaths (str or list) – 是一个 XPath(或 XPath 的列表),它定义了响应中应该从中提取链接的区域。如果给定,则只会扫描那些 XPath 选择的文本以获取链接。

Finally, it's customary to define these as class attributes instead of settings those in __init__:

最后,习惯上将这些定义为类属性而不是在 中的设置__init__

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request
from bow.items import BowItem

import sys
import MySQLdb
import hashlib
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

# _*_ coding: utf-8 _*_

class bankOfWow_spider(CrawlSpider):
    name = "bankofwow"

    allowed_domains = ["bankofwow.com"]
    start_urls = ["http://www.bankofwow.com/"]
    xpaths = '''//td[@class="CatBg" and @width="25%"
                  and @valign="top" and @align="center"]
                  /table[@cellspacing="0"]//tr/td'''

    rules = (
        Rule(SgmlLinkExtractor(restrict_xpaths=(xpaths,))),
        Rule(SgmlLinkExtractor(allow=('cart.php?')), callback='parse_items'),
        )

    def __init__(self, *a, **kw):
        # catch the spider stopping
        # dispatcher.connect(self.spider_closed, signals.spider_closed)
        # dispatcher.connect(self.on_engine_stopped, signals.engine_stopped)
        super(bankOfWow_spider, self).__init__(*a, **kw)

    def parse_items(self, response):
        sel = Selector(response)
        items = []
        listings = sel.xpath('//*[@id="tabContent"]/table/tr')

        item = IgeItem()
        item["header"] = sel.xpath('//td[@valign="center"]/h1/text()')

        items.append(item)
        return items

回答by Snail-Horn

Scheme basically has a syntax like

Scheme 基本上有一个类似的语法

scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]

Examples of popular schemesinclude http(s), ftp, mailto, file, data, and irc. There could also beterms like aboutor about:blankwe are somewhat familiar with.

的例子流行的方案包括http(s)ftpmailtofiledata,和irc。也可能有类似aboutabout:blank我们有点熟悉的术语。

It's more clear in the description on that same definition page:

在同一个定义页面上的描述中更清楚:

                    hierarchical part
        ┌───────────────────┴─────────────────────┐
                    authority               path
        ┌───────────────┴───────────────┐┌───┴────┐
  abc://username:[email protected]:123/path/data?key=value&key2=value2#fragid1
  └┬┘   └───────┬───────┘ └────┬────┘ └┬┘           └─────────┬─────────┘ └──┬──┘
scheme  user information     host     port                  query         fragment

  urn:example:mammal:monotreme:echidna
  └┬┘ └──────────────┬───────────────┘
scheme              path

In the question of Missing schemesit appears that there is [//[user:password@]host[:port]]part missing in

在问题中Missing schemes似乎[//[user:password@]host[:port]]缺少部分

data=u'cart.php?target=category&category_id=826'

as mentioned above.

正如刚才提到的。

I had a similar problem where this simple concept would suffice the solution for me!

我有一个类似的问题,这个简单的概念足以解决我的问题!

Hope this helps some.

希望这对一些人有所帮助。

回答by liaogx

change start_urlsto:

更改start_urls为:

self.start_urls = ("http://www.domainname.com/",)

it should work.

它应该工作。

回答by Shyam Prakash

The error is becauese the start_urls in tuple start_urls = ('http://quotes.toscrape.com/',)

错误是因为元组 start_urls = (' http://quotes.toscrape.com/',) 中的start_urls

change the statrs_url to list start_urls = ['http://quotes.toscrape.com/']

将 statrs_url 更改为 list start_urls = [' http://quotes.toscrape.com/']

回答by Shyam Prakash

yield{"Text": text, ^ IndentationError: unindent does not match any outer indentation level

yield{"Text": text, ^ IndentationError: unindent 不匹配任何外部缩进级别

when the error comes using the sublime editor this is using mixed space and tabs space it is difficult to find but a easy solution copy the full code into a ordinary text document

当使用 sublime 编辑器出现错误时,这是​​使用混合空格和制表符空间很难找到,但一个简单的解决方案是将完整代码复制到普通文本文档中

you can easily identify the difference under the for loop and the upcoming statements so you are able to correct it in notepad then copy it in sublime the code will run

您可以轻松识别 for 循环和即将出现的语句下的差异,以便您可以在记事本中更正它,然后将其复制到 sublime 中,代码将运行