Python Scrapy，只关注内部 URL，但提取找到的所有链接

Question

提问by sboss

I want to get all external links from a given website using Scrapy. Using the following code the spider crawls external links as well:

我想使用 Scrapy 从给定网站获取所有外部链接。使用以下代码，蜘蛛也会抓取外部链接：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from myproject.items import someItem

class someSpider(CrawlSpider):
  name = 'crawltest'
  allowed_domains = ['someurl.com']
  start_urls = ['http://www.someurl.com/']

  rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
  )

  def parse_obj(self,response):
    item = someItem()
    item['url'] = response.url
    return item

What am I missing? Doesn't "allowed_domains" prevent the external links to be crawled? If I set "allow_domains" for LinkExtractor it does not extract the external links. Just to clarify: I wan't to crawl internal links but extract external links. Any help appriciated!

我错过了什么？“allowed_domains”不会阻止外部链接被抓取吗？如果我为 LinkExtractor 设置“allow_domains”，它不会提取外部链接。澄清一下：我不想抓取内部链接，而是提取外部链接。任何帮助appriciated！

Answer 1

采纳答案by 12Ryan12

You can also use the link extractor to pull all the links once you are parsing each page.

您还可以使用链接提取器在解析每个页面后提取所有链接。

The link extractor will filter the links for you. In this example the link extractor will deny links in the allowed domain so it only gets outside links.

链接提取器将为您过滤链接。在本例中，链接提取器将拒绝允许域中的链接，因此它只能获取外部链接。

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LxmlLinkExtractor
from myproject.items import someItem

class someSpider(CrawlSpider):
  name = 'crawltest'
  allowed_domains = ['someurl.com']
  start_urls = ['http://www.someurl.com/']

  rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
        item = someItem()
        item['url'] = link.url

Answer 2

回答by aberna

A solution would be make usage a process_link function in the SgmlLinkExtractor Documentation here http://doc.scrapy.org/en/latest/topics/link-extractors.html

一个解决方案是在http://doc.scrapy.org/en/latest/topics/link-extractors.html的 SgmlLinkExtractor 文档中使用 process_link 函数

class testSpider(CrawlSpider):
    name = "test"
    bot_name = 'test'
    allowed_domains = ["news.google.com"]
    start_urls = ["https://news.google.com/"]
    rules = (
    Rule(SgmlLinkExtractor(allow_domains=()), callback='parse_items',process_links="filter_links",follow= True) ,
     )

    def filter_links(self, links):
        for link in links:
            if self.allowed_domains[0] not in link.url:
                print link.url

        return links

    def parse_items(self, response):
        ### ...

Answer 3

回答by Ohad Zadok

An updated code based on 12Ryan12's answer,

基于 12Ryan12 的答案的更新代码，

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url= Field()


class someSpider(CrawlSpider):
    name = 'crawltest'
    allowed_domains = ['someurl.com']
    start_urls = ['http://www.someurl.com/']
    rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)

    def parse_obj(self,response):
        item = MyItem()
        item['url'] = []
        for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
            item['url'].append(link.url)
        return item

Python Scrapy，只关注内部 URL，但提取找到的所有链接

提问by sboss

采纳答案by 12Ryan12

回答by aberna

回答by Ohad Zadok

相关推荐

最近更新

标签

Python Scrapy，只关注内部 URL，但提取找到的所有链接

提问by sboss

采纳答案by 12Ryan12

回答by aberna

回答by Ohad Zadok

相关推荐

如何在 Python 中定义全局函数？

Python “列表”对象没有属性“查找”

Python 在模型序列化程序中获取当前用户

python flask ImmutableMultiDict

相关推荐

最近更新

标签