定时爬取pra当日出版更新的文章

作者: DUTZCS | 来源:发表于2020-03-16 10:19 被阅读0次

定时爬取pra当日出版更新的文章
「完结篇」网络爬虫+实时监控+推送微信
scrapy对爬取的内容进行更新爬取
scrapy定时爬取
百度指数爬取工具
爬虫：5.增量爬取和去重
Python-定时爬取指定城市天气(二)-邮件提醒
爬虫定时爬取设置
简书7日热门文章数据分析+更新推送（持续更新···）
使用linux的curl命令访问带有&的url时候出错的

之前我爬取了arXiv上的文章的标题和链接，这次我们类似的爬取APS上的 Physics Review A上的文章的标题和链接。
我认为爬取中最重要的一环是观察网站，然后获取想要获取的数据的特点，最后是爬取。通过观察网站我们也可以获得很多关于这个网站的信息，乃至于这个期刊的细节。
通过观察我们发现一下几个特点：
这个网站主要分为很多子页，包括：author（和投稿相关），referee(s)（审稿人部分）， highlight（比较有特色的文章），accepted，press,recent(最近出版的文章，我们关注的就是这部分）等。
进入到recent页面之后发现主要分为两部分

左侧，从上依次为
1. 第几卷，第几页（大概是这个意思），继续观察发现是半年一卷（vol）2020年上半年是101卷。
2. 类别，包括编辑推荐，物理特色，开放获取。
3. 文章类型，编辑通知，快速通讯(RC), 文章，评论，comment，回复等等。
4. 主要是PRA 接受的文章方向，首先是编辑通知，之后就是几个方向：基础概念，量子信息，原子分子结构和动力学，高精度测量，原子分子碰撞和相互作用，原子分析在外场中的过程，包括强场和短泵浦，物质波和冷原子的集体运动，量子光学，激光物理，非线性光学和经典光学。
右侧，基本上是最近发表的文章，有时候更新（周一到周五），但是每一天都很少，大概10篇左右。包括当日的编辑推荐。每页25个。

这样我们发现了一些PRA期刊的特点，包括接受的方向，如何快速找到编辑推荐，以及发表文章时候需要主页的地方。

因为recent页面不是每天更新，有时候更新，有时候不更新，所以获取当日的文章需要判断当日有没有，并且PRA基本上是需要每天看的，所以我计划是设置定时任务，每天晚上10：30爬取，此时接进去纽约的上午10：30，一般是更新完毕了。
代码如下：

#%%
import os
import requests
import datetime
from bs4 import BeautifulSoup

from pylatex import Document, Command
from pylatex.utils import bold, NoEscape

today = datetime.datetime.now().day

os.chdir('C:\\Users\\zcs\\Desktop\\python\\pra')

url = 'https://journals.aps.org/pra/recent'

r1 = requests.get(url)
soup = BeautifulSoup(r1.content, 'lxml')

date = soup.select(
    'div#search-result-list > div:nth-child(1) > div > div.large-9.columns > h6.pub-info'
)
text = date[0].get_text()
text_f = text.split()

if int(text_f[8]) == today:

    print('Today is', text_f[8], text_f[9], text_f[10])

    datum = soup.select(
        'div#search-result-list > div > div > div.large-9.columns > h5 > a')
    datum_doi = soup.select(
        'div#search-result-list > div > div > div.large-9.columns > h6.pub-info'
    )

    num = 0
    total = 0
    title_f = []
    link_f = []
    doi_f = []
    for ii in datum:
        title = ii.get_text()
        link = ii.get('href')
        doi_fir = datum_doi[num].get_text().split()
        doi = doi_fir[0] + doi_fir[1] + doi_fir[2] + doi_fir[3] + doi_fir[4]
        num += 1
        print(doi_fir[8])
        if doi_fir[8] == str(today):
            title_f.append(title)
            link_f.append('https://journals.aps.org' + link)
            doi_f.append(doi)
            total += 1

    doc = Document()
    doc.preamble.append(Command('title', 'Phys. Rev. A'))
    doc.preamble.append(Command('author', 'Doctor Zhao'))
    doc.preamble.append(Command('date', NoEscape(r'\today')))
    doc.append(NoEscape(r'\maketitle'))
    doc.append(
        'There are %s papers. And these are titles and links of papers you want to browse:\n \n'
        % str(total))
    for jj in range(0, total):
        doc.append([jj + 1])
        title_str = title_f[jj] + '\n'
        doc.append(bold('title: ' + title_str))
        doc.append('link: ' + link_f[jj] + '\n')
        doc.append('DOI:' + doi_f[jj] + '\n \n')

    doc.generate_pdf('PRA_' + text_f[8] + text_f[9] + text_f[10])
    print('complete')
else:
    print('No update')

之后再window上执行定时任务，这样就可以在一个文件夹下自动下载当天更新出版的文章了。第二天早上就可以愉快地查看了。

网友评论

本文标题：定时爬取pra当日出版更新的文章

本文链接：https://www.haomeiwen.com/subject/vulpdhtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

定时爬取pra当日出版更新的文章

相关文章