美文网首页
利用Python语言进行给定药物网页的信息的获取和输出

利用Python语言进行给定药物网页的信息的获取和输出

作者: FANHONGZENG | 来源:发表于2023-11-23 11:25 被阅读0次

目标网页链接:https://www.fda.gov/drugs/resources-information-approved-drugs/atezolizumab-urothelial-carcinoma

在chrom浏览器中,右键选择检查,查看所需数据的HTML标签

一定要用谷歌浏览器

法一


from requests_html import HTMLSession 

from lxml import etree 

session = HTMLSession() 

r =session.get('https://www.fda.gov/drugs/resources-information-approved-drugs/atezolizumab-urothelial-carcinoma')

##查看整个网页html格式文件

print(r.html.html)

html=etree.HTML(r.html.html)

####提取标题

titles=html.xpath('/html/body/div[2]/div[1]/div/main/article/header/section/div/h1/text()')

#使用xpath全路径

#在谷歌浏览器中,查看所需数据的HTML标签,右键-选择检查-

#找到标题对应的html代码,右键选择copy-Copy Full Xpath,最后再Xpath路径后加text()

[图片上传失败...(image-6832cc-1700796324045)]


print(titles)

#['Atezolizumab for Urothelial Carcinoma']

####提取第一段  

first_paragraph=html.xpath('/html/body/div[2]/div[1]/div/main/article/div/div[1]/text()')  

print(first_paragraph)

#['On May 18, 2016, the U. S. Food and Drug Administration gave accelerated approval to atezolizumab

# injection (Tecentriq, Genentech, Inc.) for the treatment of patients with locally advanced or metastatic 

#urothelial carcinoma who have disease progression during or following platinum-containing chemotherapy

# or have disease progression within 12 months of neoadjuvant or adjuvant treatment with platinum-containing 

#chemotherapy. \xa0\xa0Atezolizumab is a programmed death-ligand 1 (PD-L1) blocking antibody.']

####提取日期  

data=

html.xpath('/html/body/div[2]/div[1]/div/main/article/aside[1]/section/div/aside/ul/div/li/div/p/time/text()')

print(data)

#['05/19/2016']

法二


import requests

import lxml.html

####提取标题

html = requests.get('https://www.fda.gov/drugs/resources-information-approved-drugs/atezolizumab-urothelial-carcinoma')

doc = lxml.html.fromstring(html.content)

new_releases = doc.xpath('//section[@id="block-entityviewcontent-2"]')[0]

titles = new_releases.xpath('.//h1[@class="content-title text-center"]/text()')

print(titles)

#['Atezolizumab for Urothelial Carcinoma']

####提取第一段

new_releases2=doc.xpath('//div[@class="col-md-8 col-md-push-2"]')[0]

##第一段html没有标识符,此处使用xpath全路径

first_paragraph=new_releases2.xpath('/html/body/div[2]/div[1]/div/main/article/div/div[1]/text()')

#['On May 18, 2016, the U. S. Food and Drug Administration gave accelerated approval to 

#atezolizumab injection (Tecentriq, Genentech, Inc.) for the treatment of patients with locally

#advanced or metastatic urothelial carcinoma who have disease progression during or following

#platinum-containing chemotherapy or have disease progression within 12 months of neoadjuvant 

#or adjuvant treatment with platinum-containing chemotherapy. \xa0\xa0Atezolizumab is a programmed

# death-ligand 1 (PD-L1) blocking antibody.']

####提取日期

new_releases3=doc.xpath('//div[@class="node-current-date"]')[0]

data=new_releases3.xpath('//time["2016-05-19T03:36:00Z"]/text()')

print(data)

#['05/19/2016']

参考链接:

https://timber.io/blog/an-intro-to-web-scraping-with-lxml-and-python/

https://www.w3school.com.cn/xpath/xpath_syntax.asp

相关文章

网友评论

      本文标题:利用Python语言进行给定药物网页的信息的获取和输出

      本文链接:https://www.haomeiwen.com/subject/dccvfrtx.html