Scrapy提取数据有自己的一套机制,被称作选择器(selectors),通过特定的Xpath或者CSS表达式来选择HTML文件的某个部分
Xpath是专门在XML文件中选择节点的语言,也可以用在HTML上。
CSS是一门将HTML文档样式化语言,选择器由它定义,并与特定的HTML元素的样式相关联。
-
xpath选择器
常用的路径表达式,这里列举了一些常用的,XPath的功能非常强大,内含超过100个的内建函数。
下面为常用的方法:
| 标识 | 解释 |
|---|---|
| / | 下一级标签 |
| // | 跳级标签 |
| contains | 常用的函数,主要是用来包含某个元素,从而快速定位 |
举例请参考我的另一篇博文:https://www.jianshu.com/p/63e6b6f36bf5
-
css选择器
CSS层叠样式表,语法由两个主要部分组成:选择器,一条或多条声明
Selector {declaration1;declaration2;……}
下面为常用的使用方法:
| 标识 | 举例 | 解释 |
|---|---|---|
| .class | .color | 选择class=”color”的所有元素 |
| #id | #info | 选择id=”info”的所有元素 |
| * | * | 选择所有元素 |
| element | p | 选择所有的p元素 |
| element,element | div,p | 选择div标签内部的所有p元素 |
| [attribute] | [target] | 选择带有targe属性的所有元素 |
| [arrtibute=value] | [target=_blank] | 选择target=”_blank”的所有元素 |
百闻不如一见,复制得来终觉浅,绝知此事要躬行
-
选择器使用案例
上面我们列举了两种选择器的常用方法,下面通过scrapy帮助文档提供的一个地址来做演示
地址:http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
这个地址的网页源码为:
- 获取title
这里的extract_first()就可以获取title标签的文本内容,因为我们第一个通过xpath返回的结果是一个列表,所以我们通过extract()之后返回的也是一个列表,而extract_first()可以直接返回第一个值,extract_first()有一个参数default,例如:extract_first(default="")表示如果匹配不到返回一个空:
C:\Users\董贺贺\example>scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html --nolog
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x000001CC166E1B38>
[s] item {}
[s] request <GET http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s] response <200 http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s] settings <scrapy.settings.Settings object at 0x000001CC166E19E8>
[s] spider <DefaultSpider 'default' at 0x1cc169ded68>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]: response.body
Out[1]: b"<html>\n <head>\n <base href='http://example.com/' />\n <title>Example website</title>\n </head>\n <body>\n <div id='images'>\n <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n </div>\n </body>\n</html>\n\n"
In [2]: response.xpath('//title/text()')
Out[2]: [<Selector xpath='//title/text()' data='Example website'>]
In [3]: response.xpath('//title/text()').extract()
Out[3]: ['Example website']
In [4]: response.xpath('//title/text()').extract_first()
Out[4]: 'Example website'
In [6]: response.xpath('//title/a/ul/text()').extract_first(default='老子错了')
Out[6]: '老子错了'
我们用css选择器获取吧:
In [7]: response.css('title::text')
Out[7]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]
In [8]: response.css('title::text').extract()
Out[8]: ['Example website']
In [9]: response.css('title::text').extract_first()
Out[9]: 'Example website'
- 查找图片信息(获取图片的src属性)
xpath:
In [10]: response.xpath('//div/a/img/@src')
Out[10]:
[<Selector xpath='//div/a/img/@src' data='image1_thumb.jpg'>,
<Selector xpath='//div/a/img/@src' data='image2_thumb.jpg'>,
<Selector xpath='//div/a/img/@src' data='image3_thumb.jpg'>,
<Selector xpath='//div/a/img/@src' data='image4_thumb.jpg'>,
<Selector xpath='//div/a/img/@src' data='image5_thumb.jpg'>]
In [11]: response.xpath('//div/a/img/@src').extract()
Out[11]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
In [12]: response.xpath('//div/a/img/@src').extract_first()
Out[12]: 'image1_thumb.jpg'
通过这个也能进一步了解extract()和extract_first()的用法
css:
In [13]: response.css('#images a img::attr(src)')
Out[13]:
[<Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image1_thumb.jpg'>,
<Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image2_thumb.jpg'>,
<Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image3_thumb.jpg'>,
<Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image4_thumb.jpg'>,
<Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image5_thumb.jpg'>]
In [14]: response.css('#images a img::attr(src)').extract()
Out[14]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
In [15]: response.css('#images a img::attr(src)').extract_first()
Out[15]: 'image1_thumb.jpg'
突然想起一件事,东莞没有严打之前,有这样一个大Baojian套餐,叫八国联军,顾名思义就是英俄日法德美意奥八个国家的妹子为你服务,哈哈哈,我下面就讲css和xpath联合起来使用。
In [16]: response.css('#images').xpath('//img/@src')
Out[16]:
[<Selector xpath='//img/@src' data='image1_thumb.jpg'>,
<Selector xpath='//img/@src' data='image2_thumb.jpg'>,
<Selector xpath='//img/@src' data='image3_thumb.jpg'>,
<Selector xpath='//img/@src' data='image4_thumb.jpg'>,
<Selector xpath='//img/@src' data='image5_thumb.jpg'>]
In [17]: response.css('#images').xpath('//img/@src').extract()
Out[17]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
- 查找a标签信息
这里分别通过xapth和css选择器获取a标签的href内容,以及文本信息,css获取属性信息是通过attr,xpath是通过@属性名
xpath:
In [18]: response.xpath('//div/a/@href')
Out[18]:
[<Selector xpath='//div/a/@href' data='image1.html'>,
<Selector xpath='//div/a/@href' data='image2.html'>,
<Selector xpath='//div/a/@href' data='image3.html'>,
<Selector xpath='//div/a/@href' data='image4.html'>,
<Selector xpath='//div/a/@href' data='image5.html'>]
In [19]: response.xpath('//div/a/@href').extract()
Out[19]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
css:
In [22]: response.css('#images a::attr(href)')
Out[22]:
[<Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image1.html'>,
<Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image2.html'>,
<Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image3.html'>,
<Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image4.html'>,
<Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image5.html'>]
In [23]: response.css('#images a::attr(href)').extract()
Out[23]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
- 高级用法
- 查找属性名称包含img的所有的超链接,通过contains实现
In [31]: response.css('a[href*=image]::attr(href)')
Out[31]:
[<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image1.html'>,
<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image2.html'>,
<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image3.html'>,
<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image4.html'>,
<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image5.html'>]
In [32]: response.css('a[href*=image]::attr(href)').extract()
Out[32]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
In [34]: response.xpath('//a[contains(@href,"image")]/@href').extract()
Out[34]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
- 查找img的src属性
In [35]: response.xpath('//a/img[contains(@src,"image")]/@src')
Out[35]:
[<Selector xpath='//a/img[contains(@src,"image")]/@src' data='image1_thumb.jpg'>,
<Selector xpath='//a/img[contains(@src,"image")]/@src' data='image2_thumb.jpg'>,
<Selector xpath='//a/img[contains(@src,"image")]/@src' data='image3_thumb.jpg'>,
<Selector xpath='//a/img[contains(@src,"image")]/@src' data='image4_thumb.jpg'>,
<Selector xpath='//a/img[contains(@src,"image")]/@src' data='image5_thumb.jpg'>]
In [36]: response.xpath('//a/img[contains(@src,"image")]/@src').extract()
Out[36]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
In [37]: response.css('a img[src*=image]::attr(src)')
Out[37]:
[<Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image1_thumb.jpg'>,
<Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image2_thumb.jpg'>,
<Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image3_thumb.jpg'>,
<Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image4_thumb.jpg'>,
<Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image5_thumb.jpg'>]
In [39]: response.css('a img[src*=image]::attr(src)').extract()
Out[39]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
- 提取a标签的文本中name后面的内容,这里提供了正则的方法re和re_first
In [54]: response.xpath('//div/a/text()').re('Name:(.*)')
Out[54]:
[' My image 1 ',
' My image 2 ',
' My image 3 ',
' My image 4 ',
' My image 5 ']
In [55]: response.xpath('//div/a/text()').re_first('Name:(.*)')
Out[55]: ' My image 1 '
In [56]: response.xpath('//div/a/text()').re_first('Name:(.*)').rstrip()
Out[56]: ' My image 1'
这个re模块倒是很好用啊!









网友评论