Python Scrapy,只关注内部 URL,但提取找到的所有链接
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27964410/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scrapy, only follow internal URLS but extract all links found
提问by sboss
I want to get all external links from a given website using Scrapy. Using the following code the spider crawls external links as well:
我想使用 Scrapy 从给定网站获取所有外部链接。使用以下代码,蜘蛛也会抓取外部链接:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from myproject.items import someItem
class someSpider(CrawlSpider):
name = 'crawltest'
allowed_domains = ['someurl.com']
start_urls = ['http://www.someurl.com/']
rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
)
def parse_obj(self,response):
item = someItem()
item['url'] = response.url
return item
What am I missing? Doesn't "allowed_domains" prevent the external links to be crawled? If I set "allow_domains" for LinkExtractor it does not extract the external links. Just to clarify: I wan't to crawl internal links but extract external links. Any help appriciated!
我错过了什么?“allowed_domains”不会阻止外部链接被抓取吗?如果我为 LinkExtractor 设置“allow_domains”,它不会提取外部链接。澄清一下:我不想抓取内部链接,而是提取外部链接。任何帮助appriciated!
采纳答案by 12Ryan12
You can also use the link extractor to pull all the links once you are parsing each page.
您还可以使用链接提取器在解析每个页面后提取所有链接。
The link extractor will filter the links for you. In this example the link extractor will deny links in the allowed domain so it only gets outside links.
链接提取器将为您过滤链接。在本例中,链接提取器将拒绝允许域中的链接,因此它只能获取外部链接。
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LxmlLinkExtractor
from myproject.items import someItem
class someSpider(CrawlSpider):
name = 'crawltest'
allowed_domains = ['someurl.com']
start_urls = ['http://www.someurl.com/']
rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)
def parse_obj(self,response):
for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
item = someItem()
item['url'] = link.url
回答by aberna
A solution would be make usage a process_link function in the SgmlLinkExtractor Documentation here http://doc.scrapy.org/en/latest/topics/link-extractors.html
一个解决方案是在http://doc.scrapy.org/en/latest/topics/link-extractors.html的 SgmlLinkExtractor 文档中使用 process_link 函数
class testSpider(CrawlSpider):
name = "test"
bot_name = 'test'
allowed_domains = ["news.google.com"]
start_urls = ["https://news.google.com/"]
rules = (
Rule(SgmlLinkExtractor(allow_domains=()), callback='parse_items',process_links="filter_links",follow= True) ,
)
def filter_links(self, links):
for link in links:
if self.allowed_domains[0] not in link.url:
print link.url
return links
def parse_items(self, response):
### ...
回答by Ohad Zadok
An updated code based on 12Ryan12's answer,
基于 12Ryan12 的答案的更新代码,
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class MyItem(Item):
url= Field()
class someSpider(CrawlSpider):
name = 'crawltest'
allowed_domains = ['someurl.com']
start_urls = ['http://www.someurl.com/']
rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)
def parse_obj(self,response):
item = MyItem()
item['url'] = []
for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
item['url'].append(link.url)
return item