Python 请求 URL 中缺少方案
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21103533/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Missing scheme in request URL
提问by Toby
I've been stuck on this bug for a while, the following error message is as follows:
被这个bug困扰了一段时间,报错信息如下:
File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\http\request\__init__.py", line 61, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
exceptions.ValueError: Missing scheme in request url: h
Scrapy code:
爬虫代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request
from spyder.items import SypderItem
import sys
import MySQLdb
import hashlib
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
# _*_ coding: utf-8 _*_
class some_Spyder(CrawlSpider):
name = "spyder"
def __init__(self, *a, **kw):
# catch the spider stopping
# dispatcher.connect(self.spider_closed, signals.spider_closed)
# dispatcher.connect(self.on_engine_stopped, signals.engine_stopped)
self.allowed_domains = "domainname.com"
self.start_urls = "http://www.domainname.com/"
self.xpaths = '''//td[@class="CatBg" and @width="25%"
and @valign="top" and @align="center"]
/table[@cellspacing="0"]//tr/td/a/@href'''
self.rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=(self.xpaths))),
Rule(SgmlLinkExtractor(allow=('cart.php?')), callback='parse_items'),
)
super(spyder, self).__init__(*a, **kw)
def parse_items(self, response):
sel = Selector(response)
items = []
listings = sel.xpath('//*[@id="tabContent"]/table/tr')
item = IgeItem()
item["header"] = sel.xpath('//td[@valign="center"]/h1/text()')
items.append(item)
return items
I'm pretty sure it's something to do with the URL I'm asking scrapy to follow in the LinkExtractor. When extracting them in shell they looking something like this:
我很确定这与我要求scrapy在LinkExtractor中遵循的URL有关。在 shell 中提取它们时,它们看起来像这样:
data=u'cart.php?target=category&category_id=826'
Compared to another URL extracted from a working spider:
与从工作蜘蛛中提取的另一个 URL 相比:
data=u'/path/someotherpath/category.php?query=someval'
I've had a look at a few questions on Stack Overflow, such as Downloading pictures with scrapybut from reading it I think I may have a slightly different problem.
我查看了 Stack Overflow 上的一些问题,例如Downloadingpictures with scrapy但从阅读它我想我可能有一个稍微不同的问题。
I also took a look at this - http://static.scrapy.org/coverage-report/scrapy_http_request___init__.html
我也看看这个 - http://static.scrapy.org/coverage-report/scrapy_http_request____init__.html
Which explains that the error is thrown up if self.URLs is missing a ":", from looking at the start_urls I've defined I can't quite see why this error would show since the scheme is clearly defined.
这解释了如果 self.URLs 缺少“:”,则会引发错误,从查看我定义的 start_urls 我不太明白为什么会显示此错误,因为该方案已明确定义。
采纳答案by Guy Gavriely
change start_urlsto:
更改start_urls为:
self.start_urls = ["http://www.bankofwow.com/"]
回答by rich tier
prepend url with 'http' or 'https'
在 url 前加上 'http' 或 'https'
回答by paul trmbrth
As @Guy answered earlier, start_urlsattribute must be a list, the exceptions.ValueError: Missing scheme in request url: hmessage comes from that: the "h" in the error message is the first character of "http://www.bankofwow.com/", interpreted as a list (of characters)
正如@Guy 之前回答的,start_urls属性必须是一个列表,exceptions.ValueError: Missing scheme in request url: h消息来自:错误消息中的“ h”是“ http://www.bankofwow.com/”的第一个字符,解释为列表(人物)
allowed_domainsmust also be a list of domains, otherwise you'll get filtered "offsite" requests.
allowed_domains还必须是域列表,否则您将收到过滤的“异地”请求。
Change restrict_xpathsto
更改restrict_xpaths为
self.xpaths = """//td[@class="CatBg" and @width="25%"
and @valign="top" and @align="center"]
/table[@cellspacing="0"]//tr/td"""
it should represent an area in the document where to find links, it should not be link URLs directly
它应该代表文档中可以找到链接的区域,它不应该直接是链接 URL
From http://doc.scrapy.org/en/latest/topics/link-extractors.html#sgmllinkextractor
来自http://doc.scrapy.org/en/latest/topics/link-extractors.html#sgmllinkextractor
restrict_xpaths (str or list) – is a XPath (or list of XPath's) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links.
restrict_xpaths (str or list) – 是一个 XPath(或 XPath 的列表),它定义了响应中应该从中提取链接的区域。如果给定,则只会扫描那些 XPath 选择的文本以获取链接。
Finally, it's customary to define these as class attributes instead of settings those in __init__:
最后,习惯上将这些定义为类属性而不是在 中的设置__init__:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request
from bow.items import BowItem
import sys
import MySQLdb
import hashlib
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
# _*_ coding: utf-8 _*_
class bankOfWow_spider(CrawlSpider):
name = "bankofwow"
allowed_domains = ["bankofwow.com"]
start_urls = ["http://www.bankofwow.com/"]
xpaths = '''//td[@class="CatBg" and @width="25%"
and @valign="top" and @align="center"]
/table[@cellspacing="0"]//tr/td'''
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=(xpaths,))),
Rule(SgmlLinkExtractor(allow=('cart.php?')), callback='parse_items'),
)
def __init__(self, *a, **kw):
# catch the spider stopping
# dispatcher.connect(self.spider_closed, signals.spider_closed)
# dispatcher.connect(self.on_engine_stopped, signals.engine_stopped)
super(bankOfWow_spider, self).__init__(*a, **kw)
def parse_items(self, response):
sel = Selector(response)
items = []
listings = sel.xpath('//*[@id="tabContent"]/table/tr')
item = IgeItem()
item["header"] = sel.xpath('//td[@valign="center"]/h1/text()')
items.append(item)
return items
回答by Snail-Horn
Scheme basically has a syntax like
Scheme 基本上有一个类似的语法
scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]
Examples of popular schemesinclude
http(s),ftp,mailto,file,data, andirc. There could also beterms likeaboutorabout:blankwe are somewhat familiar with.
的例子流行的方案包括
http(s),ftp,mailto,file,data,和irc。也可能有类似about或about:blank我们有点熟悉的术语。
It's more clear in the description on that same definition page:
在同一个定义页面上的描述中更清楚:
hierarchical part
┌───────────────────┴─────────────────────┐
authority path
┌───────────────┴───────────────┐┌───┴────┐
abc://username:[email protected]:123/path/data?key=value&key2=value2#fragid1
└┬┘ └───────┬───────┘ └────┬────┘ └┬┘ └─────────┬─────────┘ └──┬──┘
scheme user information host port query fragment
urn:example:mammal:monotreme:echidna
└┬┘ └──────────────┬───────────────┘
scheme path
In the question of Missing schemesit appears that there is [//[user:password@]host[:port]]part missing in
在问题中Missing schemes似乎[//[user:password@]host[:port]]缺少部分
data=u'cart.php?target=category&category_id=826'
as mentioned above.
正如刚才提到的。
I had a similar problem where this simple concept would suffice the solution for me!
我有一个类似的问题,这个简单的概念足以解决我的问题!
Hope this helps some.
希望这对一些人有所帮助。
回答by liaogx
change start_urlsto:
更改start_urls为:
self.start_urls = ("http://www.domainname.com/",)
it should work.
它应该工作。
回答by Shyam Prakash
The error is becauese the start_urls in tuple start_urls = ('http://quotes.toscrape.com/',)
错误是因为元组 start_urls = (' http://quotes.toscrape.com/',) 中的start_urls
change the statrs_url to list start_urls = ['http://quotes.toscrape.com/']
将 statrs_url 更改为 list start_urls = [' http://quotes.toscrape.com/']
回答by Shyam Prakash
yield{"Text": text, ^ IndentationError: unindent does not match any outer indentation level
yield{"Text": text, ^ IndentationError: unindent 不匹配任何外部缩进级别
when the error comes using the sublime editor this is using mixed space and tabs space it is difficult to find but a easy solution copy the full code into a ordinary text document
当使用 sublime 编辑器出现错误时,这是使用混合空格和制表符空间很难找到,但一个简单的解决方案是将完整代码复制到普通文本文档中
you can easily identify the difference under the for loop and the upcoming statements so you are able to correct it in notepad then copy it in sublime the code will run
您可以轻松识别 for 循环和即将出现的语句下的差异,以便您可以在记事本中更正它,然后将其复制到 sublime 中,代码将运行

