Python爬虫中BeautifulSoup正则表达式的使用

作者: 鸡仔说 | 来源:发表于2016-12-09 08:55 被阅读2557次

bs4是非常牛逼的爬虫库！深度解析爬虫利器，轻松获得网站信息！
Python 爬虫实战（二）：使用 requests-html
爬虫10-正则表达式+requests实现原生爬虫
python 网页爬虫
BeautifulSoup requests 爬虫初体验
scrapy爬取豆瓣电影Top250（第一
Python爬虫-Scrapy框架之Scrapy Shell
Python 爬虫
BeautifulSoup的简单使用
【openshift-4】实现简单爬虫功能+生成在线API

BeautifulSoup中可以通过name和attrs去定位名称和属性，以找到特定的html代码。更值得称赞的是，attrs支持正则表达式。

如：

<div class="cool">
    <h1 class="abc">design</h1>
</div>

搜索此行，可以这样写

abcSoup = soup.find(name="h1", attrs={"class":"abc"})

但如果情况变成

<div class="cool">
    <h1 class="abc">design</h1>
    <h1 class="abc test1">design photo</h1>
    <h1 class="abc test2">design product</h1>
</div>

此时，想一次性找到三个h1，就需要用到正则了。

abcSouplist = soup.find_all(name="h1", attrs={"class":re.compile(r"abc(\s\w+)?")})

便可以找到：

<h1 class="abc">design</h1>
<h1 class="abc test1">design photo</h1>
<h1 class="abc test2">design product</h1>

还有一种情况，就是判断一种属性是否存在，从而找到该文件。可以通过True和Flase进行筛选。

比如：

<div class="cool">
    <h1 class="abc" id="test">design</h1>
    <h1 class="abc test1">design photo</h1>
    <h1 class="abc test2">design product</h1>
</div>

想选择所有不存在id属性的文件，可以写如下表达式筛选。

Soup.find_all("h1",attrs={"id":Flase})

就可以筛选出下面两行h1啦⬇️

<h1 class="abc test1">design photo</h1>
<h1 class="abc test2">design product</h1>

网友评论

本文标题：Python爬虫中BeautifulSoup正则表达式的使用

本文链接：https://www.haomeiwen.com/subject/hfremttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Python爬虫中BeautifulSoup正则表达式的使用

相关文章

bs4是非常牛逼的爬虫库！深度解析爬虫利器，轻松获得网站信息！

Python 爬虫实战（二）：使用 requests-html

爬虫10-正则表达式+requests实现原生爬虫

python 网页爬虫

BeautifulSoup requests 爬虫初体验

scrapy爬取豆瓣电影Top250（第一

Python爬虫-Scrapy框架之Scrapy Shell

Python 爬虫

BeautifulSoup的简单使用

【openshift-4】实现简单爬虫功能+生成在线API

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读