基于scrapy框架的请求过滤问题

作者: Nise9s | 来源:发表于2018-03-05 14:59 被阅读0次

基于scrapy框架的请求过滤问题
Pycharm+Scrapy框架运行爬虫糗事百科（无items数
爬虫框架scrapy篇五——其他操作：post翻页请求
[CP_15] Python爬虫框架02：Scrapy框架爬取咨
Scrapy学习——基础讲解
Python爬虫基础：scrapy框架简介及第一个scrapy爬
基于Nodejs的爬虫框架Tai-Spider
Scrapy
使用scrapy框架实现简书页面数据爬取
python3 scrapy_redis 分布式爬取房天下存mo

最近被scrapy的dont_filter困扰，因为写的程序经常因为request被过滤掉而中断。
自认为还是不了解scrapy的运行机制造成的。
如下代码：

from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy import Request
from example.items import xxxxItem
import re

class xxxxSpider(Spider):
    name = "example"
    allowed_domains = ["xxxx.com.cn"]
    pat = 'http://finance.xxxx.com.cn/.*[0-9]{4}-[0-9]{2}-[0-9]{2}/[a-z]*-[a-z0-9]*.*'
    def start_requests(self):
        yield Request(url="http://finance.xxxx.com.cn/", callback=self.parse)
    def parse(self, response):
        if response.status == 200:
            URLgroup = LinkExtractor(allow=()).extract_links(response)
            for URL in URLgroup:
                key = re.findall(self.pat, URL.url)
                if key:
                    #only crawl url with a fixed prefix
                    yield Request(url=URL.url, callback=self.parse_content)
    def parse_content(self, response):
        if response.status == 200:
            content = Selector(response)
            text = content.xpath("/html/body//div[@id='artibody']//p/descendant::text()").extract() 
            if text and title:
                item = xxxxItem()
                Text = ''
                for text_one in text:
                        Text += text_one
                item["text"] = Text
                yield item
            yield Request(url=response.url, callback=self.parse, dont_filter=True)

在最后一行的request中将dont_filter设置为True,将不会导致爬虫中途停止，因为访问这个网页的request不会被filtered，进而继续爬取。