XIN71—第51周—爬虫（简单数据采集）

作者: XIN71 | 来源:发表于2018-12-21 22:33 被阅读4次

XIN71—第51周—爬虫（简单数据采集）
Python网络数据采集之图像识别与文字处理
了解爬虫1
爬虫篇(一)
源码时代教教你：Java爬虫(1)-Jsoup的入门
spider(爬虫)
数据采集与存储(一、数据接入消息队列)
深度解析爬虫(python+requests、python+se
Python学习笔记(18)数据采集、爬虫路线、连续动作三者的执
大数据学习之：Flume

import urllib.request

url = 'http://********/Country_IN/Russia_IN/SearchRUS.aspx'

headers={

'Host': '********',

'Connection': 'keep-alive',

#'Content-Length': '399',

'Accept': 'application/json, text/javascript, */*; q=0.01',

'Origin': '********',

'X-Requested-With': 'XMLHttpRequest',

'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',

'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',

'Referer': '********',

#'Accept-Encoding': 'gzip, deflate',

'Accept-Language': 'zh-CN,zh;q=0.9',

'Cookie': '********'

}

formdata = {

'action': 'GetList',

'TmpID': '********',

'Starttime': '2018-06',

'Endtime': '2018-06',

'page': '1',

'rows': '2' }

data = urllib.parse.urlencode(formdata).encode("utf-8")

request = urllib.request.Request(url, data=data, headers=headers)

response = urllib.request.urlopen(request).read().decode()

print(response)

3、数据清洗

response_dict = eval(response) #字符串转为字典

constant = response_dict['rows']

for i in range(len(constant)):

print(constant[i])

print('-----------------------------\n')

4、写入数据库

import pymssql

server = "********"

user = "********"

password = "********"

conn = pymssql.connect(server, user, password, database="ODS")

cursor = conn.cursor()

cursor.execute("SELECT ******** FROM ******** ")

client_list = []

for i in cursor:

print(i)

sql = "insert into customs.****(shipping_date) values(\'%s\')" % (constant[i]['CUSTOMS_REGISTRATION_DATE'].replace("'", "''"))

cursor.execute(sql)

conn.commit()