美文网首页python
爬虫实战1:如何从数据角度揭秘吃瓜群众怎样看待罗志祥事件

爬虫实战1:如何从数据角度揭秘吃瓜群众怎样看待罗志祥事件

作者: 有趣的数据 | 来源:发表于2020-04-25 01:36 被阅读0次

      事件已经发生好多个小时,各种版本的谣言都在上演,吃瓜群众又是怎样看待这次事件的呢,下面将从数据的角度来分析此次事件。

数据获取网站:

https://m.weibo.cn

代码展示:

# 主函数展示
def get_comment1(list_id): with open('weibo_comment_zhouyangqing1.csv', 'a', newline='') as csvfile: writer = csv.writer(csvfile, dialect='excel') writer.writerow(['微博id', '用户id', '用户名称','性别','身份认证','描述','评论','回复','点赞','认证']) for id in list_id: max_id = "" Data_all = [] while True: if max_id == "": p_url = "https://m.weibo.cn/comments/hotflow?id=4496797961825543&mid=4496797961825543&max_id_type=0" else: p_url = "https://m.weibo.cn/comments/hotflow?id=4496797961825543&mid=4496797961825543&max_id="+ str(max_id) + "&max_id_type=1" print(p_url) dic_data = get_text_one(p_url) try: if dic_data != None: max_id = dic_data["data"]["max_id"] # 转发数 datalist = dic_data['data']['data'] for d in datalist: userid = d['user']['id'] print(userid) username = d['user']['screen_name'] gender=d['user']['gender'] # 身份认证 verified = d['user']['verified'] # 评论内容 comment=d['text'] total_number=d['total_number'] like_count=d['like_count'] try: verified_reason = d['user']['verified_reason'] description=d['user']['description'] except: verified_reason = 0 description=0 Data = [id, userid, username,gender, verified,description,comment,total_number,like_count,verified_reason] Data_all.append(Data) csv_w('weibo_comment_zhouyangqing1.csv', Data_all) else: break except: pass

数据展示:

吃瓜群众的性别占比为      男vs女       34%vs66%

热词分布:

评论词频分布:

整体评论分布

男主评论分布

女主评论分布

import pandas as pdimport jiebaimport timeimport csvimport refrom wordcloud import WordCloudfrom PIL import Imageimport matplotlib.pyplot as pltimport numpy as np
data=pd.read_csv('./text3.csv',encoding='gb18030')items=data['评论'].astype(str).tolist()print(len(data))
# 创建停用词listdef stopwordslist(): stopwords = [line.strip() for line in open('./stop_word.txt', 'r', encoding='utf-8').readlines()]    return stopwords
# 去除英文def remove_sub(input_str): punc=u'123456789.' punc1 = u'123456789.a-zA-Z' output_str = re.sub(r'[{}]+'.format(punc1), '', input_str) return output_str
alice_mask = np.array(Image.open('./b2.png'))
cloud = WordCloud( #设置字体,不指定就会出现乱码 font_path="./ziti.ttf", #font_path=path.join(d,'simsun.ttc'), #设置背景色 background_color='white', #词云形状 mask=alice_mask, #允许最大词汇 max_words=200, #最大号字体 max_font_size=200, random_state=1, width=400, height=800)
q=[]outstr = ''for item in items: d=[] b=jieba.cut(item,cut_all=False) # 创建一个停用词表 stopwords=stopwordslist() for j in b: if j not in stopwords: if not remove_sub(j): continue if j !='\t': outstr+=j outstr+=" "print(len(outstr))with open('./text.txt','a')as f: f.write(outstr) f.write('\n')


cloud.generate(outstr)cloud.to_file('./pic6.png')

还有一部分评论时间没有爬取,可以看吃瓜群众评论时间的分布,是不是也是那么精力充沛。有兴趣的可以弄一下。


相关文章

网友评论

    本文标题:爬虫实战1:如何从数据角度揭秘吃瓜群众怎样看待罗志祥事件

    本文链接:https://www.haomeiwen.com/subject/uvwcwhtx.html