事件已经发生好多个小时,各种版本的谣言都在上演,吃瓜群众又是怎样看待这次事件的呢,下面将从数据的角度来分析此次事件。
数据获取网站:
https://m.weibo.cn
代码展示:
# 主函数展示
def get_comment1(list_id):
with open('weibo_comment_zhouyangqing1.csv', 'a', newline='') as csvfile:
writer = csv.writer(csvfile, dialect='excel')
writer.writerow(['微博id', '用户id', '用户名称','性别','身份认证','描述','评论','回复','点赞','认证'])
for id in list_id:
max_id = ""
Data_all = []
while True:
if max_id == "":
p_url = "https://m.weibo.cn/comments/hotflow?id=4496797961825543&mid=4496797961825543&max_id_type=0"
else:
p_url = "https://m.weibo.cn/comments/hotflow?id=4496797961825543&mid=4496797961825543&max_id="+ str(max_id) + "&max_id_type=1"
print(p_url)
dic_data = get_text_one(p_url)
try:
if dic_data != None:
max_id = dic_data["data"]["max_id"]
# 转发数
datalist = dic_data['data']['data']
for d in datalist:
userid = d['user']['id']
print(userid)
username = d['user']['screen_name']
gender=d['user']['gender']
# 身份认证
verified = d['user']['verified']
# 评论内容
comment=d['text']
total_number=d['total_number']
like_count=d['like_count']
try:
verified_reason = d['user']['verified_reason']
description=d['user']['description']
except:
verified_reason = 0
description=0
Data = [id, userid, username,gender, verified,description,comment,total_number,like_count,verified_reason]
Data_all.append(Data)
csv_w('weibo_comment_zhouyangqing1.csv', Data_all)
else:
break
except:
pass
数据展示:
吃瓜群众的性别占比为 男vs女 34%vs66%
热词分布:
评论词频分布:
整体评论分布
男主评论分布
女主评论分布
import pandas as pd
import jieba
import time
import csv
import re
from wordcloud import WordCloud
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
data=pd.read_csv('./text3.csv',encoding='gb18030')
items=data['评论'].astype(str).tolist()
print(len(data))
# 创建停用词list
def stopwordslist():
stopwords = [line.strip() for line in open('./stop_word.txt', 'r', encoding='utf-8').readlines()]
return stopwords
# 去除英文
def remove_sub(input_str):
punc=u'123456789.'
punc1 = u'123456789.a-zA-Z'
output_str = re.sub(r'[{}]+'.format(punc1), '', input_str)
return output_str
alice_mask = np.array(Image.open('./b2.png'))
cloud = WordCloud(
#设置字体,不指定就会出现乱码
font_path="./ziti.ttf",
#font_path=path.join(d,'simsun.ttc'),
#设置背景色
background_color='white',
#词云形状
mask=alice_mask,
#允许最大词汇
max_words=200,
#最大号字体
max_font_size=200,
random_state=1,
width=400,
height=800
)
q=[]
outstr = ''
for item in items:
d=[]
b=jieba.cut(item,cut_all=False)
# 创建一个停用词表
stopwords=stopwordslist()
for j in b:
if j not in stopwords:
if not remove_sub(j):
continue
if j !='\t':
outstr+=j
outstr+=" "
print(len(outstr))
with open('./text.txt','a')as f:
f.write(outstr)
f.write('\n')
cloud.generate(outstr)
cloud.to_file('./pic6.png')
还有一部分评论时间没有爬取,可以看吃瓜群众评论时间的分布,是不是也是那么精力充沛。有兴趣的可以弄一下。
网友评论