Scrapy笔记-常用指令

作者: hhuua | 来源:发表于2018-06-21 13:34 被阅读0次

Scrapy笔记-常用指令
Scrapy指令笔记
《HEAD FIRST SQL》学习笔记，初学SQL
python3.6安装scrapy框架
Scrapy爬虫项目学习
Scrapy笔记
python网络爬虫笔记三
scrapy笔记
ubuntu 常用指令笔记
Octave常用指令笔记

常用指令

创建项目

设置一个新的Scrapy项目。

scrapy startproject projectname

运行爬虫

scrapy crawl spidername

数据提取测试

scrapy shell 'hhttp://www.xxx.com'

css选择器

使用 shell，您可以尝试使用带有 response 对象的 CSS 选择元素：

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

要从上面的标题中提取文本，您可以：

>>> response.css('title::text').extract()
['Quotes to Scrape']

我们在CSS查询中添加了 ::text ，这意味着我们只想直接在 <title> 元素中选择文本元素。如果我们不指定 ::text ，我们将获得完整的 title 元素，包括其标签：

>>> response.css('title').extract()
['<title>Quotes to Scrape</title>']

可以使用 re 方法使用正则表达式进行提取：

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

Xpath

Scrapy 选择器还支持使用 XPath 表达式：

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'

数据存储

Feed

存储抓取数据的最简单方法是使用 Feed 导出(Feed exports)

scrapy crawl spidername -o xxxx.json

这将生成一个 quotes.json 文件，其中包含所有被抓取的项目，以 JSON 序列化。

使用其他格式，如JSON Lines：

scrapy crawl spidername -o xxxx.jl

由于每条记录都是单独的行，因此您可以处理大文件，而无需将所有内容都放在内存中

爬虫参数

在运行爬虫时，可以使用 -a 选项为您的爬虫提供命令行参数：

scrapy crawl spidername -o xxxx-humor.json -a tag=xxx

这些参数传递给 Spider 的 __init__ 方法，默认成为spider属性。

您可以使用此方法使您的爬虫根据参数构建 URL来实现仅抓取带有特定tag的数据：

def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

网友评论

我爱编程

本文标题：Scrapy笔记-常用指令

本文链接：https://www.haomeiwen.com/subject/rsvqyftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！