美文网首页大数据 爬虫Python AI Sql
scrapy框架选择器的用法

scrapy框架选择器的用法

作者: 小董不太懂 | 来源:发表于2019-07-23 20:22 被阅读1次

Scrapy提取数据有自己的一套机制,被称作选择器(selectors),通过特定的Xpath或者CSS表达式来选择HTML文件的某个部分
Xpath是专门在XML文件中选择节点的语言,也可以用在HTML上。
CSS是一门将HTML文档样式化语言,选择器由它定义,并与特定的HTML元素的样式相关联。

  • xpath选择器

常用的路径表达式,这里列举了一些常用的,XPath的功能非常强大,内含超过100个的内建函数。
下面为常用的方法:

标识 解释
/ 下一级标签
// 跳级标签
contains 常用的函数,主要是用来包含某个元素,从而快速定位

举例请参考我的另一篇博文:https://www.jianshu.com/p/63e6b6f36bf5

  • css选择器

CSS层叠样式表,语法由两个主要部分组成:选择器,一条或多条声明
Selector {declaration1;declaration2;……}
下面为常用的使用方法:

标识 举例 解释
.class .color 选择class=”color”的所有元素
#id #info 选择id=”info”的所有元素
* * 选择所有元素
element p 选择所有的p元素
element,element div,p 选择div标签内部的所有p元素
[attribute] [target] 选择带有targe属性的所有元素
[arrtibute=value] [target=_blank] 选择target=”_blank”的所有元素

百闻不如一见,复制得来终觉浅,绝知此事要躬行

  • 选择器使用案例

上面我们列举了两种选择器的常用方法,下面通过scrapy帮助文档提供的一个地址来做演示
地址:http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

这个地址的网页源码为:


  • 获取title
    这里的extract_first()就可以获取title标签的文本内容,因为我们第一个通过xpath返回的结果是一个列表,所以我们通过extract()之后返回的也是一个列表,而extract_first()可以直接返回第一个值,extract_first()有一个参数default,例如:extract_first(default="")表示如果匹配不到返回一个空:

C:\Users\董贺贺\example>scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html --nolog
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x000001CC166E1B38>
[s]   item       {}
[s]   request    <GET http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s]   response   <200 http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s]   settings   <scrapy.settings.Settings object at 0x000001CC166E19E8>
[s]   spider     <DefaultSpider 'default' at 0x1cc169ded68>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: response.body
Out[1]: b"<html>\n <head>\n  <base href='http://example.com/' />\n  <title>Example website</title>\n </head>\n <body>\n  <div id='images'>\n   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n  </div>\n </body>\n</html>\n\n"

In [2]: response.xpath('//title/text()')
Out[2]: [<Selector xpath='//title/text()' data='Example website'>]

In [3]: response.xpath('//title/text()').extract()
Out[3]: ['Example website']

In [4]: response.xpath('//title/text()').extract_first()
Out[4]: 'Example website'

In [6]: response.xpath('//title/a/ul/text()').extract_first(default='老子错了')
Out[6]: '老子错了'

我们用css选择器获取吧:

In [7]: response.css('title::text')
Out[7]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]

In [8]: response.css('title::text').extract()
Out[8]: ['Example website']

In [9]: response.css('title::text').extract_first()
Out[9]: 'Example website'
  • 查找图片信息(获取图片的src属性)
    xpath:
In [10]: response.xpath('//div/a/img/@src')
Out[10]:
[<Selector xpath='//div/a/img/@src' data='image1_thumb.jpg'>,
 <Selector xpath='//div/a/img/@src' data='image2_thumb.jpg'>,
 <Selector xpath='//div/a/img/@src' data='image3_thumb.jpg'>,
 <Selector xpath='//div/a/img/@src' data='image4_thumb.jpg'>,
 <Selector xpath='//div/a/img/@src' data='image5_thumb.jpg'>]

In [11]: response.xpath('//div/a/img/@src').extract()
Out[11]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

In [12]: response.xpath('//div/a/img/@src').extract_first()
Out[12]: 'image1_thumb.jpg'

通过这个也能进一步了解extract()和extract_first()的用法
css:

In [13]: response.css('#images a img::attr(src)')
Out[13]:
[<Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image1_thumb.jpg'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image2_thumb.jpg'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image3_thumb.jpg'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image4_thumb.jpg'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/descendant-or-self::*/img/@src" data='image5_thumb.jpg'>]

In [14]: response.css('#images a img::attr(src)').extract()
Out[14]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

In [15]: response.css('#images a img::attr(src)').extract_first()
Out[15]: 'image1_thumb.jpg'

突然想起一件事,东莞没有严打之前,有这样一个大Baojian套餐,叫八国联军,顾名思义就是英俄日法德美意奥八个国家的妹子为你服务,哈哈哈,我下面就讲css和xpath联合起来使用。

In [16]: response.css('#images').xpath('//img/@src')
Out[16]:
[<Selector xpath='//img/@src' data='image1_thumb.jpg'>,
 <Selector xpath='//img/@src' data='image2_thumb.jpg'>,
 <Selector xpath='//img/@src' data='image3_thumb.jpg'>,
 <Selector xpath='//img/@src' data='image4_thumb.jpg'>,
 <Selector xpath='//img/@src' data='image5_thumb.jpg'>]

In [17]: response.css('#images').xpath('//img/@src').extract()
Out[17]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']
  • 查找a标签信息
    这里分别通过xapth和css选择器获取a标签的href内容,以及文本信息,css获取属性信息是通过attr,xpath是通过@属性名
    xpath:
In [18]: response.xpath('//div/a/@href')
Out[18]:
[<Selector xpath='//div/a/@href' data='image1.html'>,
 <Selector xpath='//div/a/@href' data='image2.html'>,
 <Selector xpath='//div/a/@href' data='image3.html'>,
 <Selector xpath='//div/a/@href' data='image4.html'>,
 <Selector xpath='//div/a/@href' data='image5.html'>]

In [19]: response.xpath('//div/a/@href').extract()
Out[19]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

css:

In [22]: response.css('#images a::attr(href)')
Out[22]:
[<Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image1.html'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image2.html'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image3.html'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image4.html'>,
 <Selector xpath="descendant-or-self::*[@id = 'images']/descendant-or-self::*/a/@href" data='image5.html'>]

In [23]: response.css('#images a::attr(href)').extract()
Out[23]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
  • 高级用法
  1. 查找属性名称包含img的所有的超链接,通过contains实现
In [31]: response.css('a[href*=image]::attr(href)')
Out[31]:
[<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image1.html'>,
 <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image2.html'>,
 <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image3.html'>,
 <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image4.html'>,
 <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image5.html'>]

In [32]: response.css('a[href*=image]::attr(href)').extract()
Out[32]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
In [34]: response.xpath('//a[contains(@href,"image")]/@href').extract()
Out[34]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
  1. 查找img的src属性
In [35]: response.xpath('//a/img[contains(@src,"image")]/@src')
Out[35]:
[<Selector xpath='//a/img[contains(@src,"image")]/@src' data='image1_thumb.jpg'>,
 <Selector xpath='//a/img[contains(@src,"image")]/@src' data='image2_thumb.jpg'>,
 <Selector xpath='//a/img[contains(@src,"image")]/@src' data='image3_thumb.jpg'>,
 <Selector xpath='//a/img[contains(@src,"image")]/@src' data='image4_thumb.jpg'>,
 <Selector xpath='//a/img[contains(@src,"image")]/@src' data='image5_thumb.jpg'>]

In [36]: response.xpath('//a/img[contains(@src,"image")]/@src').extract()
Out[36]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']
In [37]: response.css('a img[src*=image]::attr(src)')
Out[37]:
[<Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image1_thumb.jpg'>,
 <Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image2_thumb.jpg'>,
 <Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image3_thumb.jpg'>,
 <Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image4_thumb.jpg'>,
 <Selector xpath="descendant-or-self::a/descendant-or-self::*/img[@src and contains(@src, 'image')]/@src" data='image5_thumb.jpg'>]

In [39]: response.css('a img[src*=image]::attr(src)').extract()
Out[39]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']
  1. 提取a标签的文本中name后面的内容,这里提供了正则的方法re和re_first

In [54]: response.xpath('//div/a/text()').re('Name:(.*)')
Out[54]:
[' My image 1 ',
 ' My image 2 ',
 ' My image 3 ',
 ' My image 4 ',
 ' My image 5 ']

In [55]: response.xpath('//div/a/text()').re_first('Name:(.*)')
Out[55]: ' My image 1 '

In [56]: response.xpath('//div/a/text()').re_first('Name:(.*)').rstrip()
Out[56]: ' My image 1'

这个re模块倒是很好用啊!

相关文章

网友评论

    本文标题:scrapy框架选择器的用法

    本文链接:https://www.haomeiwen.com/subject/eeialctx.html