scrapy框架选择器的用法

作者: 小董不太懂 | 来源:发表于2019-07-23 20:22 被阅读1次

CSDN热榜、华为云博客都可用来练习Python scrapy
scrapy框架选择器的用法
Scrapy选择器用法
Python爬虫 --- 2.2 Scrapy 选择器的介绍
Scrapy框架之Scrapy-Redis的两种使用方式
python爬虫13：scrapy
爬虫-python-scrapy框架基本命令
常用选择器
python爬虫scrapy应用场景说明
scrapy框架

Scrapy提取数据有自己的一套机制，被称作选择器（selectors）,通过特定的Xpath或者CSS表达式来选择HTML文件的某个部分
Xpath是专门在XML文件中选择节点的语言，也可以用在HTML上。
CSS是一门将HTML文档样式化语言，选择器由它定义，并与特定的HTML元素的样式相关联。

xpath选择器

常用的路径表达式，这里列举了一些常用的，XPath的功能非常强大，内含超过100个的内建函数。
下面为常用的方法：

标识	解释
/	下一级标签
//	跳级标签
contains	常用的函数，主要是用来包含某个元素，从而快速定位

举例请参考我的另一篇博文：https://www.jianshu.com/p/63e6b6f36bf5

css选择器

CSS层叠样式表，语法由两个主要部分组成：选择器，一条或多条声明
Selector {declaration1;declaration2;……}
下面为常用的使用方法：

标识	举例	解释
.class	.color	选择class=”color”的所有元素
#id	#info	选择id=”info”的所有元素
*	*	选择所有元素
element	p	选择所有的p元素
element,element	div,p	选择div标签内部的所有p元素
[attribute]	[target]	选择带有targe属性的所有元素
[arrtibute=value]	[target=_blank]	选择target=”_blank”的所有元素

百闻不如一见，复制得来终觉浅，绝知此事要躬行

选择器使用案例

上面我们列举了两种选择器的常用方法，下面通过scrapy帮助文档提供的一个地址来做演示
地址：http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

这个地址的网页源码为：

获取title
这里的extract_first()就可以获取title标签的文本内容,因为我们第一个通过xpath返回的结果是一个列表，所以我们通过extract()之后返回的也是一个列表，而extract_first()可以直接返回第一个值，extract_first()有一个参数default,例如：extract_first(default="")表示如果匹配不到返回一个空：


C:\Users\董贺贺\example>scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html --nolog
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x000001CC166E1B38>
[s]   item       {}
[s]   request    <GET http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s]   response   <200 http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s]   settings   <scrapy.settings.Settings object at 0x000001CC166E19E8>
[s]   spider     <DefaultSpider 'default' at 0x1cc169ded68>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: response.body
Out[1]: b"<html>\n <head>\n  <base href='http://example.com/' />\n  <title>Example website</title>\n </head>\n <body>\n  <div id='images'>\n   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n  </div>\n </body>\n</html>\n\n"

In [2]: response.xpath('//title/text()')
Out[2]: [<Selector xpath='//title/text()' data='Example website'>]

In [3]: response.xpath('//title/text()').extract()
Out[3]: ['Example website']

In [4]: response.xpath('//title/text()').extract_first()
Out[4]: 'Example website'

In [6]: response.xpath('//title/a/ul/text()').extract_first(default='老子错了')
Out[6]: '老子错了'

我们用css选择器获取吧：

In [7]: response.css('title::text')
Out[7]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]

In [8]: response.css('title::text').extract()
Out[8]: ['Example website']

In [9]: response.css('title::text').extract_first()
Out[9]: 'Example website'

查找图片信息（获取图片的src属性）
xpath：

In [10]: response.xpath('//div/a/img/@src')
Out[10]:
[<Selector xpath='//div/a/img/@src' data='image1_thumb.jpg'>,
 <Selector xpath='//div/a/img/@src' data='image2_thumb.jpg'>,
 <Selector xpath='//div/a/img/@src' data='image3_thumb.jpg'>,
 <Selector xpath='//div/a/img/@src' data='image4_thumb.jpg'>,
 <Selector xpath='//div/a/img/@src' data='image5_thumb.jpg'>]

In [11]: response.xpath('//div/a/img/@src').extract()
Out[11]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

In [12]: response.xpath('//div/a/img/@src').extract_first()
Out[12]: 'image1_thumb.jpg'

通过这个也能进一步了解extract()和extract_first()的用法
css:

In [13]: response.css('#images a img::attr(src)')
Out[13]:
[<Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image1_thumb.jpg'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image2_thumb.jpg'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image3_thumb.jpg'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image4_thumb.jpg'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image5_thumb.jpg'>]

In [14]: response.css('#images a img::attr(src)').extract()
Out[14]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

In [15]: response.css('#images a img::attr(src)').extract_first()
Out[15]: 'image1_thumb.jpg'

突然想起一件事，东莞没有严打之前，有这样一个大Baojian套餐，叫八国联军，顾名思义就是英俄日法德美意奥八个国家的妹子为你服务，哈哈哈，我下面就讲css和xpath联合起来使用。

In [16]: response.css('#images').xpath('//img/@src')
Out[16]:
[<Selector xpath='//img/@src' data='image1_thumb.jpg'>,
 <Selector xpath='//img/@src' data='image2_thumb.jpg'>,
 <Selector xpath='//img/@src' data='image3_thumb.jpg'>,
 <Selector xpath='//img/@src' data='image4_thumb.jpg'>,
 <Selector xpath='//img/@src' data='image5_thumb.jpg'>]

In [17]: response.css('#images').xpath('//img/@src').extract()
Out[17]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

查找a标签信息
这里分别通过xapth和css选择器获取a标签的href内容，以及文本信息，css获取属性信息是通过attr,xpath是通过@属性名
xpath:

In [18]: response.xpath('//div/a/@href')
Out[18]:
[<Selector xpath='//div/a/@href' data='image1.html'>,
 <Selector xpath='//div/a/@href' data='image2.html'>,
 <Selector xpath='//div/a/@href' data='image3.html'>,
 <Selector xpath='//div/a/@href' data='image4.html'>,
 <Selector xpath='//div/a/@href' data='image5.html'>]

In [19]: response.xpath('//div/a/@href').extract()
Out[19]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

css:

In [22]: response.css('#images a::attr(href)')
Out[22]:
[<Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image1.html'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image2.html'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image3.html'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image4.html'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image5.html'>]

In [23]: response.css('#images a::attr(href)').extract()
Out[23]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

高级用法

查找属性名称包含img的所有的超链接，通过contains实现

In [31]: response.css('a[href*=image]::attr(href)')
Out[31]:
[<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image1.html'>,
 <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image2.html'>,
 <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image3.html'>,
 <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image4.html'>,
 <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image5.html'>]

In [32]: response.css('a[href*=image]::attr(href)').extract()
Out[32]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [34]: response.xpath('//a[contains(@href,"image")]/@href').extract()
Out[34]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

查找img的src属性

In [35]: response.xpath('//a/img[contains(@src,"image")]/@src')
Out[35]:
[<Selector xpath='//a/img[contains(@src,"image")]/@src' data='image1_thumb.jpg'>,
 <Selector xpath='//a/img[contains(@src,"image")]/@src' data='image2_thumb.jpg'>,
 <Selector xpath='//a/img[contains(@src,"image")]/@src' data='image3_thumb.jpg'>,
 <Selector xpath='//a/img[contains(@src,"image")]/@src' data='image4_thumb.jpg'>,
 <Selector xpath='//a/img[contains(@src,"image")]/@src' data='image5_thumb.jpg'>]

In [36]: response.xpath('//a/img[contains(@src,"image")]/@src').extract()
Out[36]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

In [37]: response.css('a img[src*=image]::attr(src)')
Out[37]:
[<Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image1_thumb.jpg'>,
 <Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image2_thumb.jpg'>,
 <Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image3_thumb.jpg'>,
 <Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image4_thumb.jpg'>,
 <Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image5_thumb.jpg'>]

In [39]: response.css('a img[src*=image]::attr(src)').extract()
Out[39]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

提取a标签的文本中name后面的内容，这里提供了正则的方法re和re_first


In [54]: response.xpath('//div/a/text()').re('Name:(.*)')
Out[54]:
[' My image 1 ',
 ' My image 2 ',
 ' My image 3 ',
 ' My image 4 ',
 ' My image 5 ']

In [55]: response.xpath('//div/a/text()').re_first('Name:(.*)')
Out[55]: ' My image 1 '

In [56]: response.xpath('//div/a/text()').re_first('Name:(.*)').rstrip()
Out[56]: ' My image 1'

这个re模块倒是很好用啊！

CSDN热榜、华为云博客都可用来练习Python scrapy
这篇博客补充一下 scrapy 选择器相关知识。 scrapy 选择器 scrapy 框架自带数据提取机制，相关内...
scrapy框架选择器的用法
Scrapy提取数据有自己的一套机制，被称作选择器（selectors）,通过特定的Xpath或者CSS表达式来选...
Scrapy选择器用法
Xpath选择器参考：http://www.scrapyd.cn/doc/186.html CSS选择器参考：...
Python爬虫 --- 2.2 Scrapy 选择器的介绍
在使用Scrapy框架之前，我们必须先了解它是如何筛选数据的， Scrapy提取数据有自己的一套机制，被称作选择器...
Scrapy框架之Scrapy-Redis的两种使用方式
整理一下Scrapy-Redis的两种用法前提熟练使用Scrapy框架,并且已经安装了Redis服务器和Scr...
python爬虫13：scrapy
scrapy的结构 scrapy的工作原理 scrapy的用法
爬虫-python-scrapy框架基本命令
爬虫-python-scrapy框架基本命令创建一个项目抓取页面网页抓取shell XPath（选择器节点）...
常用选择器
元素选择器 id选择器 id选择器的用法类选择器 class用法选择器分组（并集选择器）并集选择器用法通配...
python爬虫scrapy应用场景说明
Scrapy 是一个用 Python 编写的快速、开源的 web 爬行框架，用于在基于 XPath 的选择器的帮助...
scrapy框架
大家好，我是天空之城，今天给大家带来，运用scrapy爬虫框架高效爬取数据和存储数据。Scrapy的用法0.创建S...

scrapy框架选择器的用法

xpath选择器

css选择器

选择器使用案例

相关文章

CSDN热榜、华为云博客都可用来练习Python scrapy

scrapy框架选择器的用法

Scrapy选择器用法

Python爬虫 --- 2.2 Scrapy 选择器的介绍

Scrapy框架之Scrapy-Redis的两种使用方式

python爬虫13：scrapy

爬虫-python-scrapy框架基本命令

常用选择器

python爬虫scrapy应用场景说明

scrapy框架

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

大数据爬虫Python AI Sql