linkextractor是用来提取文本url(其实提取text也是可以的)
具体的用法,先看看实例:
先创建两个html文本(和当前的py文件同一目录就可以了):
<!--example1.html-->
<html>
<body>
<div id="top">
<p>下面是一些站内链接</p>
<a class="internal" href="intro/install.html">Installation guide</a>
<a class="internal" href="intro/tutorial.html">Tutorial</a>
<a class="internal" href="../examples.html">Examples</a>
</div>
<div id="bottom">
<p>下面是一些站外链接</p>
<a href="http://stackoverflow.com/tags/scrapy/info">StackOverflow</a>
<a href="https://github.com/scrapy/scrapy">Fork on Github</a>
</div>
</body>
</html>
#第二个文本:
<!--example2.html-->
<html>
<head>
<script type='text/javascript' src='/js/app1.js'>
<script type='text/javascript' src='/js/app2.js'>
</head>
<body>
<a href="/home.html">主页</a>
<a href="javascript:goToPage('/doc.html');return false">文档</a>
<a href="javascript:goToPage('/example.html');return false">案例</a>
</body>
</html>
下来是py文件,这么写:
from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractor
#书本的写法会报编码错误
>>>html1=open('example1.html').read()
#以下两种写法都可以,不会报错
>>>html1=open('example1.html','r',encoding='utf-8').read()
>>>html1=open('example1.html','rb').read()
>>>html2=open('example2.html','rb').read()
>>>response1=HtmlResponse(url='http://example1.com',body=html1,encoding='utf-8')
>>>response2=HtmlResponse(url='http://example2.com',body=html2,encoding='utf-8')
#创建linkextractor对象,不传入参数,默认提取所有链接
>>>le=LinkExtractor()
#extract_links()提取response1的内容
links=le.extract_links(response1)
print(links)
>>>[Link(url='http://example1.com/intro/install.html', text='Installation guide', fragment='', nofollow=False), Link(url='http://example1.com/intro/tutorial.html', text='Tutorial', fragment='', nofollow=False), Link(url='http://example1.com/examples.html', text='Examples', fragment='', nofollow=False), Link(url='http://stackoverflow.com/tags/scrapy/info', text='StackOverflow', fragment='', nofollow=False), Link(url='https://github.com/scrapy/scrapy', text='Fork on Github', fragment='', nofollow=False)]--
可以看到,返回的结果是一个list,而且,link具有"四"个属性,一般来说,link.url就是我们需要的
比如:
for link in links:
print(link.url)
>>>http://example1.com/intro/install.html
http://example1.com/intro/tutorial.html
http://example1.com/examples.html
http://stackoverflow.com/tags/scrapy/info
https://github.com/scrapy/scrapy
link.url属性类似前面学习的response.urljoin()方法,残缺的网址会自动补齐,如果文本的url是完整的,那么就输出该完整的url,不再进行补充.
>>>print(link.text)
Installation guide
Tutorial
Examples
StackOverflow
Fork on Github
下来介绍linkextractor()的详细参数:
1.allow:传入正则表达式(或re列表),提取符合该re的链接:
例如提取所有以"intro"开头的链接:
#pattern='intro/.+/.html$'
#pattern='/intro/.+\.html$'
pattern='/intro/.+.html$'
le01=LinkExtractor(allow=pattern)
links01=le01.extract_links(response1)
for link01 in links01:
print(link01.url)
>>>http://example1.com/intro/install.html
http://example1.com/intro/tutorial.html
2.deny:与"allow"相反,提取除了传入的正则匹配以外的所有链接:
from urllib import parse
#书本是这么写的:
#把url拆分,然后在重新组成url
#个人感觉这样有点多余...
#pattern='^'+parse.urlparse(response1.url).geturl()
pattern='^'+response1.url
print(pattern)
#http://example1.com
le02=LinkExtractor(deny=pattern)
links02=le02.extract_links(response1)
for link02 in links02:
print(link02.url)
#http://stackoverflow.com/tags/scrapy/info
https://github.com/scrapy/scrapy
3.allow_domains()参数:提取域名链接,比如可以传入一个域名列表:
allow_domains=['github.com','stackoverflow.com']
le03=LinkExtractor(allow_domains=allow_domains)
links03=le03.extract_links(response1)
for link03 in links03:
print(link03.url)
>>>http://stackoverflow.com/tags/scrapy/info
https://github.com/scrapy/scrapy
4.deny_domains()参数:与allow_domains()相反:
一下是排除'github.com'域名的实例:
deny_domains='github.com'
le04=LinkExtractor(deny_domains=deny_domains)
links04=le04.extract_links(response1)
for link04 in links04:
print(link04.url)
>>>http://example1.com/intro/install.html
http://example1.com/intro/tutorial.html
http://example1.com/examples.html
http://stackoverflow.com/tags/scrapy/info
5.restrict_xpaths():传入xpath表达式,提取该表达式底下的所有链接:以下是提取div标签,属性id='top'底下的所有链接:
le05=LinkExtractor(restrict_xpaths='//div[@id="top"]')
links05=le05.extract_links(response1)
for link05 in links05:
print(link05.url)
>>>http://example1.com/intro/install.html
http://example1.com/intro/tutorial.html
http://example1.com/examples.html
6.tags(标签),attrs(属性)一般是搭配一起使用的,示例如下:
le06=LinkExtractor(tags='script',attrs='src')
links06=le06.extract_links(response2)
for link06 in links06:
print(link06.url)
#这里不懂,为什么返回的只有一行
#书上返回的是两行
#我认为结果应该是两行才对,想不懂....
>>>http://example2.com/js/app1.js
还有一个process_value(),找个时间再看看.
----------------------------review版----------------------------
应用实例:
# -*- coding: utf-8 -*-
import scrapy
from books.items import BooksItem
from scrapy.linkextractors import LinkExtractor
class BookInfoSpider(scrapy.Spider):
name = 'book_info'
allowed_domains = ['books.toscrape.com']
#start_urls = ['http://books.toscrape.com']
def start_requests(self):
url='http://books.toscrape.com'
headers={'User-Agent':'Mozilla/5.0'}
yield scrapy.Request(url,callback=self.parse_book,headers=headers)
def parse_book(self, response):
paths=response.xpath('//li[@class="col-xs-6 col-sm-4 col-md-3 col-lg-3"]/article')
for path in paths:
book=BooksItem()
book['name']=path.xpath('./h3/a/text()').extract_first()
book['price']=path.xpath('./div[2]/p[1]/text()').extract_first()
yield book
le=LinkExtractor(restrict_xpaths='//li[@class="next"]')
links=le.extract_links(response=response)
for link in links:
next_page=link.url
if next_page:
yield scrapy.Request(next_page,callback=self.parse_book)
'''next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()
if next_page:
next_page=response.urljoin(next_page)
yield scrapy.Request(next_page,callback=self.parse_book)'''







网友评论