Scrapy 总结

作者: zcwfeng | 来源:发表于2019-03-19 19:30 被阅读4次

Scrapy爬虫教程五爬虫部署
Scrapy爬虫教程一 Windows下安装Scrapy的方式和
Scrapy爬虫教程二浅析最烦人的反爬虫手段
Scrapy爬虫教程三详细的Python Scrapy模拟登录
Python WordCloud对电视剧<<猎场&g
Scrapy爬虫教程四 Scrapy+Selenium有浏览器界
爬虫框架srcapy入门
Scrapy总结
scrapy总结
scrapy总结

安装和需要环境

首先mac,搭建好了python环境
pip install Scrapy

期间可能需要的一些依赖
Scrapy is written in pure Python and depends on a few key Python packages (among others):

lxml, an efficient XML and HTML parser
parsel, an HTML/XML data extraction library written on top of lxml,
w3lib, a multi-purpose helper for dealing with URLs and web page encodings
twisted, an asynchronous networking framework
cryptography and pyOpenSSL, to deal with various network-level security needs

基本测试使用

创建 scrapy startproject tutorial
列出该项目目录下总共有几个爬虫

scrapy list

运行 scrapy crawl quotes
命令行抽取数据 scrapy shell 'http://quotes.toscrape.com/page/1/'
通过css 查找节点,抽取文本

response.css('title')
response.css('title::text').extract()
response.css('title').extract()
response.css('title::text').extract_first()
response.css('title::text')[0].extract()
response.css('title::text').re(r'Quotes.*')
response.css('title::text').re(r'Q\w+')
response.css('title::text').re(r'(\w+) to (\w+)')
通过xPath 抽取
response.xpath('//title')
response.xpath('//title/text()').extract_first()

命令行抽取 scrapy shell 'http://quotes.toscrape.com'
查找作者等一些节点详细信息

response.css("div.quote")
quote = response.css("div.quote")[0]
title = quote.css("span.text::text").extract_first()
author = quote.css("small.author::text").extract_first()
tags = quote.css("div.tags a.tag::text").extract()

程序里面的抽取demo

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

抽取结果保存

scrapy crawl quotes -o quotes.json
保存为JSON Lines
scrapy crawl quotes -o quotes.jl
抽取链接
response.css('li.next a').extract_first()
response.css('li.next a::attr(href)').extract_first()

Demo

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

带参数抽取命令

scrapy crawl quotes -o quotes-humor.json -a tag=humor
格式化输出json
scrapy runspider quotes_spider.py -o quotes.json

命令行CommondLine

生成新的爬虫

scrapy genspider mydomain mydomain.com
scrapy genspider -l
scrapy genspider example example.com
scrapy genspider -t crawl scrapyorg scrapy.org

网友评论

移动前端 Python Android Java

本文标题：Scrapy 总结

本文链接：https://www.haomeiwen.com/subject/neznfftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Scrapy 总结

安装和需要环境

基本测试使用

命令行CommondLine

相关文章

Scrapy爬虫教程五爬虫部署

Scrapy爬虫教程一 Windows下安装Scrapy的方式和

Scrapy爬虫教程二浅析最烦人的反爬虫手段

Scrapy爬虫教程三详细的Python Scrapy模拟登录

Python WordCloud对电视剧<<猎场&g

Scrapy爬虫教程四 Scrapy+Selenium有浏览器界

爬虫框架srcapy入门

Scrapy总结

scrapy总结

scrapy总结

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

移动前端 Python Android Java