mitmproxy抓取商标网查询数据

作者: 雨夜剪魂 | 来源:发表于2019-07-29 14:56 被阅读0次

mitmproxy抓取商标网查询数据
爬虫：使用mitmproxy抓包手机APP的配置步骤
Mac安装配置mitmproxy
爬虫工程师熬夜写了这篇文章，关于Python爬虫的一些方法总结！
04_中央气象台
iOS 防止 Charles 抓取数据
03_中央气象台
MySQL 多列索引优化小记
requests-code说明
规划 Prometheus 的存储用量

先开启mitmproxy, 命令如下

mitmweb -s 抓取.py

我们需要对request的url做过滤，发现请求的url包含有ajax字符就提取response内容，然后使用bs4解析xml，获取需要的数据，代码如下：

from mitmproxy import ctx

from bs4 import BeautifulSoup

import pandas as pd

from selenium import webdriver

# def request(flow):

# flow.request.headers['user-agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'

def run_selenium():

driver = webdriver.PhantomJS()

url = 'http://wsjs.saic.gov.cn/txnRead01.do?SVVVdE0o=KalIkqkedI6edI6edpSi_r6ZKYhRAJQahSFFMpYtTEaqqH0'

driver.get(url)

def response(flow):

ctx.log.error('获取的url是: ' + flow.request.url)

if 'txnRead02.ajax' in flow.request.url:

soup = BeautifulSoup(flow.response.text, 'xml')

for record in soup.find_all('record'):

item = {}

item['index'] = record.find('index').get_text()

item['注册号'] = record.find('sn').get_text()

item['中文名称'] = record.find('hnc').get_text()

item['注册时间'] = record.find('mno').get_text()

item['英文名称'] = record.find('hne').get_text()

item['国际分类'] = record.find('nc').get_text()

ctx.log.warn(str(item))

df = pd.DataFrame(item, index = ['0'])

header = True if item['index'] == 1 else False

df.to_csv('/爬虫例子/商标.csv', mode = 'a', encoding='utf_8_sig', index = False, header = header)

# [ctx.log.warn(a.get('href')) for a in soup.find_all('a')]

if __name__ == "__main__":

run_selenium()