Python Scrapy,只关注内部 URL,但提取找到的所有链接

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27964410/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:32:24  来源:igfitidea点击:

Scrapy, only follow internal URLS but extract all links found

pythonscrapyweb-crawlerscrapescrapy-spider

提问by sboss

I want to get all external links from a given website using Scrapy. Using the following code the spider crawls external links as well:

我想使用 Scrapy 从给定网站获取所有外部链接。使用以下代码,蜘蛛也会抓取外部链接:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from myproject.items import someItem

class someSpider(CrawlSpider):
  name = 'crawltest'
  allowed_domains = ['someurl.com']
  start_urls = ['http://www.someurl.com/']

  rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
  )

  def parse_obj(self,response):
    item = someItem()
    item['url'] = response.url
    return item

What am I missing? Doesn't "allowed_domains" prevent the external links to be crawled? If I set "allow_domains" for LinkExtractor it does not extract the external links. Just to clarify: I wan't to crawl internal links but extract external links. Any help appriciated!

我错过了什么?“allowed_domains”不会阻止外部链接被抓取吗?如果我为 LinkExtractor 设置“allow_domains”,它不会提取外部链接。澄清一下:我不想抓取内部链接,而是提取外部链接。任何帮助appriciated!

采纳答案by 12Ryan12

You can also use the link extractor to pull all the links once you are parsing each page.

您还可以使用链接提取器在解析每个页面后提取所有链接。

The link extractor will filter the links for you. In this example the link extractor will deny links in the allowed domain so it only gets outside links.

链接提取器将为您过滤链接。在本例中,链接提取器将拒绝允许域中的链接,因此它只能获取外部链接。

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LxmlLinkExtractor
from myproject.items import someItem

class someSpider(CrawlSpider):
  name = 'crawltest'
  allowed_domains = ['someurl.com']
  start_urls = ['http://www.someurl.com/']

  rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
        item = someItem()
        item['url'] = link.url

回答by aberna

A solution would be make usage a process_link function in the SgmlLinkExtractor Documentation here http://doc.scrapy.org/en/latest/topics/link-extractors.html

一个解决方案是在http://doc.scrapy.org/en/latest/topics/link-extractors.html的 SgmlLinkExtractor 文档中使用 process_link 函数

class testSpider(CrawlSpider):
    name = "test"
    bot_name = 'test'
    allowed_domains = ["news.google.com"]
    start_urls = ["https://news.google.com/"]
    rules = (
    Rule(SgmlLinkExtractor(allow_domains=()), callback='parse_items',process_links="filter_links",follow= True) ,
     )

    def filter_links(self, links):
        for link in links:
            if self.allowed_domains[0] not in link.url:
                print link.url

        return links

    def parse_items(self, response):
        ### ...

回答by Ohad Zadok

An updated code based on 12Ryan12's answer,

基于 12Ryan12 的答案的更新代码,

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url= Field()


class someSpider(CrawlSpider):
    name = 'crawltest'
    allowed_domains = ['someurl.com']
    start_urls = ['http://www.someurl.com/']
    rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)

    def parse_obj(self,response):
        item = MyItem()
        item['url'] = []
        for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
            item['url'].append(link.url)
        return item