美文网首页
Week2 Homework:100K data from Ga

Week2 Homework:100K data from Ga

作者: 快要没时间了 | 来源:发表于2016-05-28 23:44 被阅读0次
  • Cause I don't have a Weibo account, so the homework will be recorded here.

Target

Get all categorical datas in the bottom on page.
Detailed instructions are shown below.

Categories( 20 total ) Elements in Detail page

** Tips **
1 You do not need to crawl irregular page.
2 Strongly recommended to use a multi-process way.

Here we go

Step 1: Get all the categories' Links

from bs4 import BeautifulSoup
import requests


start_url = 'http://bj.ganji.com/wu/'
url_host = 'http://bj.ganji.com'

def get_index_url(url):
    wb_data = requests.get(url)
    wb_data.encoding = wb_data.apparent_encoding
    soup = BeautifulSoup(wb_data.text, 'lxml')
    links = soup.select('.fenlei > dt > a')
    for link in links:
        print(link)
        page_url = url_host + link.get('href')
        print(page_url)

The code is easy to implement, but need to pay attention to coding problems. So I check the Request_docs, and add wb_data.encoding things in it.
Then run this code as ** get_index_url(start_url) **. You will get all those categories' links below:

Results1

Remember to save all the links in a list named * channel_list *.

channel_list = '''
http://bj.ganji.com/jiaju/
http://bj.ganji.com/rirongbaihuo/
http://bj.ganji.com/shouji/
http://bj.ganji.com/shoujihaoma/
http://bj.ganji.com/bangong/
http://bj.ganji.com/nongyongpin/
http://bj.ganji.com/jiadian/
http://bj.ganji.com/ershoubijibendiannao/
http://bj.ganji.com/ruanjiantushu/
http://bj.ganji.com/yingyouyunfu/
http://bj.ganji.com/diannao/
http://bj.ganji.com/xianzhilipin/
http://bj.ganji.com/fushixiaobaxuemao/
http://bj.ganji.com/meironghuazhuang/
http://bj.ganji.com/shuma/
http://bj.ganji.com/laonianyongpin/
http://bj.ganji.com/xuniwupin/
http://bj.ganji.com/qitawupin/
http://bj.ganji.com/ershoufree/
http://bj.ganji.com/wupinjiaohuan/
'''

Step2: Get Links in each category

Now we are going to crawl each category's list for each page like this below:

Example List

for each item in the list, we can pick out the item link by BeautifulSoup.

def fromChannelGetLink(channel, page):
    client = pymongo.MongoClient('localhost', 27017)
    db = client['ganjiDB']
    urlListSheet = db['urlList']
    # http://gz.ganji.com/jiaju/o3/
    url = channel + 'o' + str(page) + '/'  # 构造指定频道的指定页面url
    print(url)
    try:
        response = requests.get(url,headers=headers, timeout=5.0)
        soup = BeautifulSoup(response.text, 'lxml')
        links = soup.select('a.ft-tit')
        # print(links)
        if soup.find('ul', 'pageLink'):
            for link in links:
                link = link.get('href')
                if link.find('zhuanzhuan') > 0 or link.find('click') > 0:  # 筛掉一些无用链接
                    continue
                # print(link)
                urlListSheet.insert_one({'url': link})
        else:
            print(url + " maybe the end Page of this channel")
            pass
            return 1
        return 0
    except :
        return 2

There is a problem about how to make sure wether this page is the last of this channel.
At first I try to judge it by the number of items in this page, if it's less than 5. So it maybe last page. But this way always lead to a quick end, which I don't figure out yet.

Final page judgement

So I use another judgment statement like the code above. I search for the "* NextPage *" link in the bottom of the page, it's more obvious and accurate.
All this process should be assigned to multiple processes, it's not difficult in Python.

from tools import *
from multiprocessing import Pool
import pymongo

def getLinksFromChannel(channel):
    sum = 0
    pageNumber = 1
    while True:
        ret = fromChannelGetLink(channel, pageNumber)
        if ret == 1:
            break
        time.sleep(0.5)
        pageNumber += 1
    print(channel, " 结束对此频道的爬取,结束页面为", pageNumber)

if __name__ == '__main__':
    pool = Pool()
    pool.map(getLinksFromChannel, channelList.split())
    pool.close()
    pool.join()

But there is something wrong, it's still can't reach the final page easily. I guess it's because the network in my house is week, which make the response lost. So I enhance the exit judgment condition, use the function return value.

Enhance quit Judgement

In this way, progress should make it sure for five times before it end the search in this channel. At last, I got 48419 links in all the channels.(Maybe bj.ganji.com will have more data than 50k, whatever. Q.Q)

All links in mongoDB

Step3: Get all Items' Infos

def getItemInfo(url):
    url = url.strip()
    client = pymongo.MongoClient('localhost', 27017)
    db = client['ganjiDB']
    itemListSheet = db['itemList']
    print(url)
    try:
        wb_data = requests.get(url,headers=headers)
        wb_data.encoding = wb_data.apparent_encoding
        if wb_data.status_code == 404:
            pass
            print('Item is sold')
        else:
            soup = BeautifulSoup(wb_data.text, 'lxml')
            title = soup.head.title.string
            data = soup.select('i.pr-5')[0].string.strip()[:-3]
            mtype = soup.select('ul.det-infor > li > span > a')[0].string
            price = soup.select('i.f22.fc-orange')[0].string
            location = soup.select('div.det-laybox > ul.det-infor > li')[2].get_text().replace(' ', '').replace('\n', '')
            otherInfo = soup.select('ul.second-det-infor')
            if otherInfo:
                otherInfo = otherInfo[0].get_text().replace(' ', '').replace('\n', '')
            print(title, data, mtype, price, location, otherInfo, url)
            itemListSheet.insert_one({
                'title': title,
                'data': data,
                'type': mtype,
                'price': price,
                'location': location,
                'otherInfo': otherInfo
            })
    except IndexError as elist:
        print(url, " list Index " , elist)
        # time.sleep(1)
        pass
    except Exception as e:
        print(url, "has Exception" , e)

There are 2 error exception in this code. I still can't understand why, maybe it's caused by the poor internet connection in this building?
Here is the main code:

if __name__ == '__main__':
    client = pymongo.MongoClient('localhost',27017)
    db = client['ganjiDB']
    urlListSheet = db['urlList']
    itemListSheet = db['itemList']
    itemList = [each['url'] for each in urlListSheet.find()]
    print(len(itemList))
    pool = Pool()
    pool.map(getItemInfo,itemList)
    pool.close()
    pool.join()

First I get all the links(48K) and put them in a list. Then I dispatch the list to the functional Method '* getItemInfo *'.

To Increase the Speed?
try this out:

headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36',
'Connection': 'keep-alive'
}

http://cn-proxy.com/

proxy_list = [ 'http://117.177.250.151:8081', 'http://111.85.219.250:3129', 'http://122.70.183.138:8118',]
proxy_ip = random.choice(proxy_list) # 随机获取代理
ipproxies = {'http': proxy_ip}


## Finally
As we are taught in the video. People is curious about how the program running.
So I made a counter here, which should be running in terminal while the Spiders are working.

import time
import pymongo

client = pymongo.MongoClient('localhost', 27017)
db = client['ganjiDB']
urlListSheet = db['urlList']
itemListSheet = db['itemList']

while True:
print('urlList Size is:', urlListSheet.find().count())
print('itemList Size is:', itemListSheet.find().count())
print('---------------------------')
time.sleep(5)

#Appendix
[MongoDB_Tutorial ( cn_Zh )](http://www.runoob.com/mongodb/mongodb-tutorial.html)
[MongoDB_CheatSheet.pdf (en)](https://blog.codecentric.de/files/2012/12/MongoDB-CheatSheet-v1_0.pdf)

相关文章

网友评论

      本文标题:Week2 Homework:100K data from Ga

      本文链接:https://www.haomeiwen.com/subject/gomfdttx.html