哔哩哔哩八月生活搞笑区热度视频数据分析

Author：南岛鹋
发布时间：September 16, 2020
3218 views
No comments
18763 words
Categories：项目数据挖掘与分析

分析哔哩哔哩生活搞笑区的热度视频信息，分析月度视频的热词，三联等数据对视频播放量的影响。

数据爬取

确定目标

因为想要一个量大的数据集，因此没有考虑热榜排名，因为所有区加起来也才一千左右。全部视频信息的话技术不行，然后就盯上了分区榜。

分区榜

从这个榜单可以选择时间段，可以根据每个月的视频热度排名等信息，来分析月度热点，哪些视频更加容易火，以及各种因素对视频播放量的影响。虽然只是一个小分区月度热度排名，并不包含全部视频，但是数据量也是极大的。下图可以看到接近有23万条数据。

数据量

网站分析

这里存在一个难点，就是虽然浏览器上是可以查看网页源码，并且包含了视频的相关信息，但是用requests请求之后的网页源码却并没有相关的信息。因此前两个版本，我采用了selenium库的方法来获取信息，但是这个方法有一个缺点，速度慢（因为要跟浏览器一样加载整个页面信息）、信息少（只有标题、作者、视频简介、以及视频页和个人主页网址），很麻烦。于是这次我换成了API调用的方法。

页面分析

我们选择一个具体的数字来查找，可以发现搜索出来一个search的接口。

页面分析

点进去之后，可以发现里面的result共有20条数据，刚好对应着每页20个视频。

数据分析

可以看到里面包含了作者、标题、标签、播放等一系列数据。

接口为https://s.search.bilibili.com/cate/search?callback=main_ver=v3&search_type=video&view_type=hot_rank&order=click©_right=-1&cate_id=138&page=1&pagesize=20&jsonp=jsonp&time_from=20200801&time_to=20200831 ,view_type为排行类型，page为页面数，pagesize为页面最大的视频数，上限好像是100。最后面就是时间了。

但是我还需要三连数据以及UP主的粉丝量。同理分析

得到三连的API接口：https://api.bilibili.com/x/web-interface/archive/stat?aid=371876135

其中aid由BV转换。

粉丝数为https://api.bilibili.com/x/relation/stat?vmid=32172331

mid可以在第一个接口处获取。

IP池

这时候虽然已经可以开始爬取了，但是如果数据量稍微有一点大，访问稍微有点频繁，就会导致IP被屏蔽。

这时候我们就需要用到代理IP，免费的代理IP虽然也有，而且GITHUB上也有专门的项目来建立代理IP池。但是免费的终究很麻烦，于是我选择了日租独享的IP。http://www.xdaili.cn/

代码

# coding: utf-8
# Author：南岛鹋 
# Blog: www.ndmiao.cn
# Date ：2020/8/25 10:29
# Tool ：PyCharm

import requests
import csv
import json
import random
import time


class video_data:
    def __init__(self):
        self.url = 'https://s.search.bilibili.com/cate/search?main_ver=v3&search_type=video&view_type=hot_rank&order=click&copy_right=-1&cate_id=138&page={}&pagesize=20&jsonp=jsonp&time_from=20200801&time_to=20200831'
        self.page = 11507
        self.alphabet = 'fZodR9XQDSUm21yCkr6zBqiveYah8bt4xsWpHnJE7jL5VG3guMTKNPAwcF'

    def dec(self, x):  # BV号转换成AV号
        r = 0
        for i, v in enumerate([11, 10, 3, 8, 4, 6]):
            r += self.alphabet.find(x[v]) * 58 ** i
        return (r - 0x2_0840_07c0) ^ 0x0a93_b324

    def random_headers(self, path): # 随机读取一个头信息
        with open(path, 'r') as f:
            data = f.readlines()
            f.close()

        reg = []
        for i in data:
            k = eval(i)  # 将字符串转化为字典形式
            reg.append(k)
        header = random.choice(reg)
        return header

    def get_ip(self): # 代理IP获取
        print('切换IP中.......')
        url = '代理IP的地址'
        ip = requests.get(url).text
        if ip in ['{"ERRORCODE":"10055","RESULT":"提取太频繁,请按规定频率提取!"}', '{"ERRORCODE":"10098","RESULT":"可用机器数量不足"}']: # 出现频繁或者机器不足，睡眠14秒
            time.sleep(14)
            ip = requests.get(url).text
            print(ip)
        else:
            print(ip)
        proxies = {
            'https': 'http://' + ip,
            'http': 'http://' + ip
        } # 设置https和http可以按需选择
        return proxies

    def get_requests(self, url, proxy): # 请求的函数
        headers = self.random_headers('headers.txt')
        # 将头信息和IP写入，用try来减少意外对程序的影响
        try: 
            response = requests.get(url, timeout=3, headers=headers, proxies=proxy)
        except requests.exceptions.RequestException as e:
            print(e)
            proxy = self.get_ip()
            try:
                response = requests.get(url, timeout=3, headers=headers, proxies=proxy)
            except requests.exceptions.RequestException as e:
                print(e)
                print('原始IP')
                response = requests.get(url, timeout=3, headers=headers)
        return response, proxy

    def get_follower(self, mid, proxy): # 获取粉丝数
        url = 'https://api.bilibili.com/x/relation/stat?vmid=' + str(mid)
        r, proxy = self.get_requests(url, proxy)
        result = json.loads(r.text) # 用json来解析文本
        # 按照需求获取需要的数据，因为粉丝数是必定存在的，所以失败了需要多次尝试获取。
        try:
            follower = result['data']['follower']
        except:
            follower,proxy = self.get_follower(mid, proxy)
        return follower, proxy

    def get_view(self, BV, proxy): # 获取三连和播放
        aid = self.dec(BV)
        url = 'https://api.bilibili.com/x/web-interface/archive/stat?aid=' + str(aid)
        r, proxy = self.get_requests(url, proxy)
        result = json.loads(r.text)
        view = {}# 因为视频虽然在排行榜，但是很可能已经删除，所以没有数据为None
        try:
            view['view'] = result['data']['view']
            view['danmu'] = result['data']['danmaku']
            view['reply'] = result['data']['reply']
            view['like'] = result['data']['like']
            view['coin'] = result['data']['coin']
            view['favorite'] = result['data']['favorite']
            view['share'] = result['data']['share']
            view['rank'] = result['data']['his_rank']
        except:
            view['view'] = 'None'
            view['danmu'] = 'None'
            view['reply'] = 'None'
            view['like'] = 'None'
            view['coin'] = 'None'
            view['favorite'] = 'None'
            view['share'] = 'None'
            view['rank'] = 'None'
        return view, proxy

    def get_parse(self, result, proxy): # 整合数据
        content = []
        items = result['result']
        for item in items:
            pubdate = item['pubdate']
            title = item['title']
            author = item['author']
            bvid = item['bvid']
            mid = item['mid']
            follower, proxy = self.get_follower(mid, proxy)
            video_view, proxy = self.get_view(bvid, proxy)
            view = video_view['view']
            danmu = video_view['danmu']
            reply = video_view['reply']
            like = video_view['like']
            coin = video_view['coin']
            favorite = video_view['favorite']
            share = video_view['share']
            rank = video_view['rank']
            tag = item['tag']
            con = [pubdate, title, author, bvid, mid, follower,view,danmu,reply,like,coin,favorite,share,rank, tag]
            content.append(con)
            print(con)
        print(content)
        self.save(content)
        return proxy

    def write_header(self):
        header = ['日期', '标题', '作者', 'BV', 'mid', '粉丝', '播放', '弹幕', '评论', '点赞','硬币','收藏','转发','排名','标签']
        with open('fun_video.csv', 'a', encoding='gb18030', newline='')as f:
            write = csv.writer(f)
            write.writerow(header)

    def save(self,content):# 存入csv
        with open('fun_video.csv', 'a', encoding='gb18030', newline='')as file:
            write = csv.writer(file)
            write.writerows(content)

    def run(self):
        #self.write_header()
        proxy = self.get_ip()
        for i in range(168, self.page):
            url = self.url.format(i)
            response, proxy = self.get_requests(url, proxy)
            result = json.loads(response.text)
            proxy = self.get_parse(result, proxy)
            print('第{}页爬取完毕'.format(i))


if __name__ == '__main__':
    video = video_data()
    video.run()

数据分析

以下代码在Notebook上运行
首先我们需要导入自己需要用到的库

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS #导入模块worldcloud
from PIL import Image #导入模块PIL(Python Imaging Library)图像处理库
import numpy as np #导入模块numpy，多维数组
import matplotlib
import jieba

数据预处理

读取数据

data=open(r'D:\video_data.csv',encoding='utf-8')
video_data = pd.read_csv(data)

查看数据前五行

video_data.head()#查看前五行

前五行数据

浏览数据的大概信息

video_data.info()#视频数据的信息

数据信息

对数据进行预处理，将None值换成0，将数字数据的格式换成int

video_data['播放'].replace('None', 0,inplace = True)#将数据中标记为None的数据替换成0，方便数据处理
video_data['弹幕'].replace('None', 0,inplace = True)
video_data['评论'].replace('None', 0,inplace = True)
video_data['点赞'].replace('None', 0,inplace = True)
video_data['硬币'].replace('None', 0,inplace = True)
video_data['收藏'].replace('None', 0,inplace = True)
video_data['转发'].replace('None', 0,inplace = True)
video_data['排名'].replace('None', 0,inplace = True)
video_data['播放'] = video_data['播放'].astype("int64")#将数字的格式转换成int格式，用于数据处理
video_data['弹幕'] = video_data['弹幕'].astype("int")
video_data['评论'] = video_data['评论'].astype("int")
video_data['点赞'] = video_data['点赞'].astype("int")
video_data['硬币'] = video_data['硬币'].astype("int")
video_data['收藏'] = video_data['收藏'].astype("int")
video_data['转发'] = video_data['转发'].astype("int")
video_data['排名'] = video_data['排名'].astype("int")

查看预处理之后的数据格式

video_data.info()

转换后

对标题进行预处理，只保留中文字符

video_data['标题'] = video_data['标题'].str.replace(r'[^\u4e00-\u9fa5]','')#只保留中文

将标题分割成一个个短词

video_data['标题'].fillna(' ',inplace=True) #将空值替换成空格
video_data['标题'] = video_data['标题'].apply(lambda x:' '.join(jieba.cut(x)))#将句子分割成一个个词语
video_data['标题'].head()

处理后的结果

同理处理标签

#同理处理标签
video_data['标签'] = video_data['标签'].str.replace(',','')
video_data['标签'].fillna(' ',inplace=True)
video_data['标签'] = video_data['标签'].apply(lambda x:' '.join(jieba.cut(x)))
video_data['标签'].head()

标签取词结果

将时间信息转化成标准格式

video_data.日期 = pd.to_datetime(video_data.日期.str.findall(r'\d{4}.+').str.get(0)) #将时间进行解析，转化为标准格式
video_data['weekday'] = video_data.日期.dt.weekday #获取星期几
video_data['hour'] = video_data.日期.dt.hour #获取小时

设置一个四舍五入代码

#用于计算三连、弹幕、评论率
def new_round(_float, _len):
    if isinstance(_float, float):
        if str(_float)[::-1].find('.') <= _len:
            return(_float)
        if str(_float)[-1] == '5':
            return(round(float(str(_float)[:-1]+'6'), _len))
        else:
            return(round(_float, _len))
    else:
        return(round(_float, _len))

计算三连等比率

video_data['点赞率']=new_round(video_data.点赞/video_data.播放*100,0)
video_data['硬币率']=new_round(video_data.硬币/video_data.播放*100,0)
video_data['收藏率']=new_round(video_data.收藏/video_data.播放*100,0)
video_data['转发率']=new_round(video_data.转发/video_data.播放*100,0)
video_data['弹幕率']=new_round(video_data.弹幕/video_data.播放*100,0)
video_data['评论率']=new_round(video_data.评论/video_data.播放*100,1)

查看处理后的数据

video_data.head()

处理后的数据

数据分析

查看一共有几位UP主

print('共有{}位UP，分别是'.format(len(video_data['UP'].unique())))#unique是将重复的去除
video_data['UP'].unique()

UP主的数量

统计每个播放量区间的视频数量

# 计算每个播放量区间的视频数量
data_length = len(video_data)
data_length_rate0 = len(video_data[video_data['播放']<10000])
data_length_rate1 = len(video_data[(video_data['播放']>=10000) & (video_data['播放']<100000)])
data_length_rate2 = len(video_data[(video_data['播放']>=100000) & (video_data['播放']<500000)])
data_length_rate3 = len(video_data[(video_data['播放']>=500000) & (video_data['播放']<1000000)])
data_length_rate4 = len(video_data[video_data['播放']>1000000])

结果展示

video_rate = [data_length_rate0,data_length_rate1,data_length_rate2,data_length_rate3,data_length_rate4]
data_view = ['[0,9999]','[10000,99999]','[100000,499999]','[500000,999999]','[1000000,....]']
video_rate

结果查看

画图展示

# 画出饼图
plt.rcParams['font.sans-serif']=['SimHei'] # 中文不乱码
plt.rcParams['axes.unicode_minus'] = False
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
           data_view,
           fontsize=12,
           title="区间",
           loc="center left",
           bbox_to_anchor=(1, 0.9))
plt.title("播放量占比")
plt.show()

扇形图展示

只展示一万播放量以上的内容

video_rate = [data_length_rate1,data_length_rate2,data_length_rate3,data_length_rate4]
data_view = ['[10000,99999]','[100000,499999]','[500000,999999]','[1000000,....]']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
           data_view,
           fontsize=12,
           title="区间",
           loc="center left",
           bbox_to_anchor=(1, 0.9))
plt.title("播放量占比")
plt.show()

一万以上的展示图

统计展示播放量排名前二十的UP主

#统计播放量排名前20的UP
top_20=video_data.sort_values(by=['播放'],ascending=False)[:20]
top_20['UP'].value_counts()

数量展示

前20的具体数据

# 前20的具体数据
top_20[['UP','播放','粉丝']]

前20的视频基本数据

根据UP主分组，对每个UP八月的总播放量进行排序

#根据UP主分组，对每个UP八月的总播放量进行排序
print(video_data.groupby('UP')['播放'].sum().sort_values(ascending=False)[:20])
top_1 = video_data[video_data['UP']=='大霓奈']
print(top_1['UP'].value_counts())
top_1 = video_data[video_data['UP']=='陈师姬']
print(top_1['UP'].value_counts())
top_1 = video_data[video_data['UP']=='17岁反派里的持枪Boy']
print(top_1['UP'].value_counts())

播放量和展示

对每个UP主的弹幕数综合进行排序

# 对每个UP主的弹幕数综合进行排序
video_data.groupby('UP')['弹幕'].sum().sort_values(ascending=False)[:20]

弹幕数展示

评论数展示

# 对每个UP主的评论数综合进行排序
video_data.groupby('UP')['评论'].sum().sort_values(ascending=False)[:20]

评论数展示

视频数量展示

# 对八月份每个UP主发的视频数量进行统计
video_data['UP'].value_counts()[:20]

视频数量展示

对每周不同时间段发布的视频数量进行统计

# 对每周不同时间段发布的视频数量进行统计
fig1, ax1=plt.subplots(figsize=(14,4))
df=video_data.groupby(['hour', 'weekday']).count()['mid'].unstack()
df.plot(ax=ax1, style='-.')
plt.show()

视频数量随时间分布图

# 对每周不同时间的视频播放量进行统计
fig2,ax2=plt.subplots(figsize=(14,4))
df=video_data.groupby(['hour','weekday']).sum()['播放'].unstack()
df.plot(ax=ax2,style='-.')
plt.show()

视频播放量总和随时间分布

# 对每周不同时间段发布的视频播放量大于10000的视频数量进行汇总
view_1 = video_data[video_data['播放']>10000]
fig2,ax2=plt.subplots(figsize=(14,4))
df=view_1.groupby(['hour','weekday']).sum()['mid'].unstack()
df.plot(ax=ax2,style='-.')
plt.show()

大于一万播放量的视频数量随时间分布图

view_2 = video_data[video_data['播放']>100000]
view_3 = video_data[video_data['播放']>1000000]

用词云显示热词

matplotlib.rcParams['font.sans-serif'] = ['KaiTi']#作图的中文
matplotlib.rcParams['font.serif'] = ['KaiTi']#作图的中文
infile = open("D:/stopwords.txt",encoding='utf-8')
stopwords_lst = infile.readlines()
STOPWORDS = [x.strip() for x in stopwords_lst] #去除头尾字符
stopwords = set(STOPWORDS) #设置停用词

def ciyun(texts,mid='all'): #支持指定UP主
    if mid == 'all':
        text = ' '.join(texts)
    else:
        text = ' '.join(texts[video_data['mid']==mid])

    wc = WordCloud(font_path="msyh.ttc", background_color='white', max_words=100, stopwords=stopwords, max_font_size=80, random_state=42, margin=3) #配置词云参数
    wc.generate(text) #生成词云
    plt.imshow(wc,interpolation="bilinear")#作图
    plt.axis("off") #不显示坐标轴

ciyun(video_data['标题'])

大于一万视频的热词

ciyun(view_1['标题'])

大于10万播放视频的热词

ciyun(view_2['标题'])

大于100万播放视频的热词

ciyun(view_3['标题'])

同理查看标签的热词

ciyun(video_data['标签'])

ciyun(view_1['标签'])

ciyun(view_2['标签'])

ciyun(view_3['标签'])

统计标题中包含老师的视频数和播放量综合

# 统计标题中包含老师的视频数和播放量综合
video_teacher = video_data[video_data['标题'].str.contains("老师")]
teacher = [len(video_teacher),video_teacher['播放'].sum()]
teacher

[3022, 16610756]

video_bro = video_data[video_data['标题'].str.contains("兄弟")]
brother = [len(video_bro),video_bro['播放'].sum()]
brother

[1897, 25270292]

video_girlfriend = video_data[video_data['标题'].str.contains("女朋友")]
girlfriend = [len(video_girlfriend),video_girlfriend['播放'].sum()]
girlfriend

[830, 28265224]

# 包含女朋友的标题中包含兄弟的视频信息
fun = video_girlfriend[video_girlfriend['标题'].str.contains("兄弟")].drop_duplicates()
print(len(fun))
print(fun)

女朋友的标题中包含兄弟的视频信息

video_yidan = video_data[video_data['标题'].str.contains("一旦")]
yidan = [len(video_yidan),video_yidan['播放'].sum()]
yidan

[89, 28302099]

video_wubei = video_data[video_data['标题'].str.contains("吾辈")]
wubei = [len(video_wubei),video_wubei['播放'].sum()]
wubei

[318, 35563900]

video_waizui = video_data[video_data['标题'].str.contains("歪嘴")]
waizui = [len(video_waizui),video_waizui['播放'].sum()]
waizui

[1810, 70787655]

查看带有这几个热词标题视频的个数饼状图

video_rate = [teacher[0],brother[0],girlfriend[0],yidan[0],wubei[0],waizui[0]]
data_view = ['老师','兄弟','女朋友','一旦','吾辈','歪嘴']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
           data_view,
           fontsize=12,
           title="区间",
           loc="center left",
           bbox_to_anchor=(1, 0.9))
plt.title("数量占比")
plt.show()

带有热刺标题的视频播放量饼图

video_rate = [teacher[1],brother[1],girlfriend[1],yidan[1],wubei[1],waizui[1]]
data_view = ['老师','兄弟','女朋友','一旦','吾辈','歪嘴']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
           data_view,
           fontsize=12,
           title="区间",
           loc="center left",
           bbox_to_anchor=(1, 0.9))
plt.title("播放量占比")
plt.show()

同理处理标签的

video_bijian = video_data[video_data['标签'].str.contains("必剪")]
bijian = [len(video_bijian),video_bijian['播放'].sum()]
video_fun = video_data[video_data['标签'].str.contains("恶作剧")]
fun = [len(video_fun),video_fun['播放'].sum()]
video_tc = video_data[video_data['标签'].str.contains("吐槽")]
tc = [len(video_tc),video_tc['播放'].sum()]
video_beauty = video_data[video_data['标签'].str.contains("美女")]
beauty = [len(video_beauty),video_beauty['播放'].sum()]
video_wezy = video_data[video_data['标签'].str.contains("万恶之源")]
wezy = [len(video_wezy),video_wezy['播放'].sum()]
video_show = video_data[video_data['标签'].str.contains("表演")]
show = [len(video_show),video_show['播放'].sum()]
video_tuwei = video_data[video_data['标签'].str.contains("土味")]
tuwei = [len(video_tuwei),video_tuwei['播放'].sum()]


video_rate = [bijian[0],fun[0],tc[0],beauty[0],wezy[0],show[0],tuwei[0]]
data_view = ['必剪','恶作剧','吐槽','美女','万恶之源','表演','土味']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
           data_view,
           fontsize=12,
           title="区间",
           loc="center left",
           bbox_to_anchor=(1, 0.9))
plt.title("数量占比")
plt.show()

video_rate = [bijian[1],fun[1],tc[1],beauty[1],wezy[1],show[1],tuwei[1]]
data_view = ['必剪','恶作剧','吐槽','美女','万恶之源','表演','土味']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
           data_view,
           fontsize=12,
           title="区间",
           loc="center left",
           bbox_to_anchor=(1, 0.9))
plt.title("数量占比")
plt.show()

对大于1万播放量的视频三连率等进行排序

# 对大于1万播放量的视频三连率等进行排序
video_rate = video_data[video_data['播放']>10000]
like_20=video_rate.sort_values(by=['点赞率'],ascending=False)[:20]
coin_20=video_rate.sort_values(by=['硬币率'],ascending=False)[:20]
sc_20=video_rate.sort_values(by=['收藏率'],ascending=False)[:20]
share_20=video_rate.sort_values(by=['转发率'],ascending=False)[:20]
danmu_20=video_rate.sort_values(by=['弹幕率'],ascending=False)[:20]
command_20=video_rate.sort_values(by=['评论率'],ascending=False)[:20]


like_20[['标题','播放','UP','点赞','点赞率']]

点赞率前20

coin_20[['标题','播放','UP','硬币','硬币率']]

硬币率前20

sc_20[['标题','播放','UP','收藏','收藏率']]

收藏率前20

share_20[['标题','播放','UP','转发','转发率']]

分享率前20

danmu_20[['标题','播放','UP','弹幕','弹幕率']]

弹幕率前20

command_20[['标题','播放','UP','评论','评论率']]

评论率前20

结论

生活搞笑区的视频中，大部分视频的播放量都集中在10000以下，占了93.86%。
要想获得高播放，则有三条途径：粉丝数、视频质量、视频数量。
每个月大量上传视频获取高播放完全有可能。播放总和最高的两位UP，一个投了154个视频，一个投了528个。
弹幕和评论则是粉丝数多的UP占优势，粉丝黏性高。
八月投放视频最多的UP是老年人诱捕大队队长，一共投放了6932个视频。
视频主要集中在10:00-24:00投放，这个区间的播放总和也是最高。
八月热词主要是龙王，节日相关，哔哩哔哩活动以及相关的UP主。
哔哩哔哩相关活动热词视频播放量普遍较低，UP相关的和月度梗相关的播放量收益最好。
三连率、弹幕率、转发率、评论率对视频播放量的影响不大。

资料

Last modification：September 16, 2020

如果觉得我的文章对你有用，请随意赞赏

哔哩哔哩八月生活搞笑区热度视频数据分析

南岛鹋 • 2020 年 09 月 16 日

<blockquote>分析哔哩哔哩生活搞笑区的热度视频信息，分析月度视频的热词，三联等数据对视频播放量的影响。</blockquote><h2>数据爬取</h2><h3>确定目标</h3>因为想要一个量大的数据集，因此没有考虑热榜排名，因为所有区加起来也才一千左右。全部视频信息的话技术不行，然后就盯上了分区榜。<img src="https://www.ndmiao.cn/usr/uploads/2020/09/784618101.png" alt="分区榜" title="分区榜" style="">从这个榜单可以选择时间段，可以根据每个月的视频热度排名等信息，来分析月度热点，哪些视频更加容易火，以及各种因素对视频播放量的影响。虽然只是一个小分区月度热度排名，并不包含全部视频，但是数据量也是极大的。下图可以看到接近有23万条数据。<img src="https://www.ndmiao.cn/usr/uploads/2020/09/2927177774.png" alt="数据量" title="数据量" style=""><h3>网站分析</h3>这里存在一个难点，就是虽然浏览器上是可以查看网页源码，并且包含了视频的相关信息，但是用requests请求之后的网页源码却并没有相关的信息。因此前两个版本，我采用了selenium库的方法来获取信息，但是这个方法有一个缺点，速度慢（因为要跟浏览器一样加载整个页面信息）、信息少（只有标题、作者、视频简介、以及视频页和个人主页网址），很麻烦。于是这次我换成了API调用的方法。<img src="https://www.ndmiao.cn/usr/uploads/2020/09/122847043.png" alt="页面分析" title="页面分析" style="">我们选择一个具体的数字来查找，可以发现搜索出来一个search的接口。<img src="https://www.ndmiao.cn/usr/uploads/2020/09/2667775604.png" alt="页面分析" title="页面分析" style="">点进去之后，可以发现里面的result共有20条数据，刚好对应着每页20个视频。<img src="https://www.ndmiao.cn/usr/uploads/2020/09/788882558.png" alt="数据分析" title="数据分析" style="">可以看到里面包含了作者、标题、标签、播放等一系列数据。接口为<code><a class="no-external-link" href="https://s.search.bilibili.com/cate/search?callback=main_ver=v3&search_type=video&view_type=hot_rank&order=click&copy_right=-1&cate_id=138&page=1&pagesize=20&jsonp=jsonp&time_from=20200801&time_to=20200831" target="_blank">https://s.search.bilibili.com/cate/search?callback=main_ver=v3&search_type=video&view_type=hot_rank&order=click&copy_right=-1&cate_id=138&page=1&pagesize=20&jsonp=jsonp&time_from=20200801&time_to=20200831</a> </code>,view_type为排行类型，page为页面数，pagesize为页面最大的视频数，上限好像是100。最后面就是时间了。但是我还需要三连数据以及UP主的粉丝量。同理分析得到三连的API接口：<code><a class="no-external-link" href="https://api.bilibili.com/x/web-interface/archive/stat?aid=371876135" target="_blank">https://api.bilibili.com/x/web-interface/archive/stat?aid=371876135</a> </code>其中aid由BV转换。粉丝数为<code><a class="no-external-link" href="https://api.bilibili.com/x/relation/stat?vmid=32172331" target="_blank">https://api.bilibili.com/x/relation/stat?vmid=32172331</a> </code>mid可以在第一个接口处获取。<h3>IP池</h3>这时候虽然已经可以开始爬取了，但是如果数据量稍微有一点大，访问稍微有点频繁，就会导致IP被屏蔽。这时候我们就需要用到代理IP，免费的代理IP虽然也有，而且GITHUB上也有专门的项目来建立代理IP池。但是免费的终究很麻烦，于是我选择了日租独享的IP。<code><a class="no-external-link" href="http://www.xdaili.cn/" target="_blank">http://www.xdaili.cn/</a> </code><h3>代码</h3><pre><code># coding: utf-8
# Author：南岛鹋 
# Blog: www.ndmiao.cn
# Date ：2020/8/25 10:29
# Tool ：PyCharm

import requests
import csv
import json
import random
import time

class video_data:
    def __init__(self):
        self.url = 'https://s.search.bilibili.com/cate/search?main_ver=v3&amp;search_type=video&amp;view_type=hot_rank&amp;order=click&amp;copy_right=-1&amp;cate_id=138&amp;page={}&amp;pagesize=20&amp;jsonp=jsonp&amp;time_from=20200801&amp;time_to=20200831'
        self.page = 11507
        self.alphabet = 'fZodR9XQDSUm21yCkr6zBqiveYah8bt4xsWpHnJE7jL5VG3guMTKNPAwcF'

def dec(self, x):  # BV号转换成AV号
        r = 0
        for i, v in enumerate([11, 10, 3, 8, 4, 6]):
            r += self.alphabet.find(x[v]) * 58 ** i
        return (r - 0x2_0840_07c0) ^ 0x0a93_b324

def random_headers(self, path): # 随机读取一个头信息
        with open(path, 'r') as f:
            data = f.readlines()
            f.close()

reg = []
        for i in data:
            k = eval(i)  # 将字符串转化为字典形式
            reg.append(k)
        header = random.choice(reg)
        return header

def get_ip(self): # 代理IP获取
        print('切换IP中.......')
        url = '代理IP的地址'
        ip = requests.get(url).text
        if ip in ['{&quot;ERRORCODE&quot;:&quot;10055&quot;,&quot;RESULT&quot;:&quot;提取太频繁,请按规定频率提取!&quot;}', '{&quot;ERRORCODE&quot;:&quot;10098&quot;,&quot;RESULT&quot;:&quot;可用机器数量不足&quot;}']: # 出现频繁或者机器不足，睡眠14秒
            time.sleep(14)
            ip = requests.get(url).text
            print(ip)
        else:
            print(ip)
        proxies = {
            'https': 'http://' + ip,
            'http': 'http://' + ip
        } # 设置https和http可以按需选择
        return proxies

def get_requests(self, url, proxy): # 请求的函数
        headers = self.random_headers('headers.txt')
        # 将头信息和IP写入，用try来减少意外对程序的影响
        try: 
            response = requests.get(url, timeout=3, headers=headers, proxies=proxy)
        except requests.exceptions.RequestException as e:
            print(e)
            proxy = self.get_ip()
            try:
                response = requests.get(url, timeout=3, headers=headers, proxies=proxy)
            except requests.exceptions.RequestException as e:
                print(e)
                print('原始IP')
                response = requests.get(url, timeout=3, headers=headers)
        return response, proxy

def get_follower(self, mid, proxy): # 获取粉丝数
        url = 'https://api.bilibili.com/x/relation/stat?vmid=' + str(mid)
        r, proxy = self.get_requests(url, proxy)
        result = json.loads(r.text) # 用json来解析文本
        # 按照需求获取需要的数据，因为粉丝数是必定存在的，所以失败了需要多次尝试获取。
        try:
            follower = result['data']['follower']
        except:
            follower,proxy = self.get_follower(mid, proxy)
        return follower, proxy

def get_view(self, BV, proxy): # 获取三连和播放
        aid = self.dec(BV)
        url = 'https://api.bilibili.com/x/web-interface/archive/stat?aid=' + str(aid)
        r, proxy = self.get_requests(url, proxy)
        result = json.loads(r.text)
        view = {}# 因为视频虽然在排行榜，但是很可能已经删除，所以没有数据为None
        try:
            view['view'] = result['data']['view']
            view['danmu'] = result['data']['danmaku']
            view['reply'] = result['data']['reply']
            view['like'] = result['data']['like']
            view['coin'] = result['data']['coin']
            view['favorite'] = result['data']['favorite']
            view['share'] = result['data']['share']
            view['rank'] = result['data']['his_rank']
        except:
            view['view'] = 'None'
            view['danmu'] = 'None'
            view['reply'] = 'None'
            view['like'] = 'None'
            view['coin'] = 'None'
            view['favorite'] = 'None'
            view['share'] = 'None'
            view['rank'] = 'None'
        return view, proxy

def get_parse(self, result, proxy): # 整合数据
        content = []
        items = result['result']
        for item in items:
            pubdate = item['pubdate']
            title = item['title']
            author = item['author']
            bvid = item['bvid']
            mid = item['mid']
            follower, proxy = self.get_follower(mid, proxy)
            video_view, proxy = self.get_view(bvid, proxy)
            view = video_view['view']
            danmu = video_view['danmu']
            reply = video_view['reply']
            like = video_view['like']
            coin = video_view['coin']
            favorite = video_view['favorite']
            share = video_view['share']
            rank = video_view['rank']
            tag = item['tag']
            con = [pubdate, title, author, bvid, mid, follower,view,danmu,reply,like,coin,favorite,share,rank, tag]
            content.append(con)
            print(con)
        print(content)
        self.save(content)
        return proxy

def write_header(self):
        header = ['日期', '标题', '作者', 'BV', 'mid', '粉丝', '播放', '弹幕', '评论', '点赞','硬币','收藏','转发','排名','标签']
        with open('fun_video.csv', 'a', encoding='gb18030', newline='')as f:
            write = csv.writer(f)
            write.writerow(header)

def save(self,content):# 存入csv
        with open('fun_video.csv', 'a', encoding='gb18030', newline='')as file:
            write = csv.writer(file)
            write.writerows(content)

def run(self):
        #self.write_header()
        proxy = self.get_ip()
        for i in range(168, self.page):
            url = self.url.format(i)
            response, proxy = self.get_requests(url, proxy)
            result = json.loads(response.text)
            proxy = self.get_parse(result, proxy)
            print('第{}页爬取完毕'.format(i))

if __name__ == '__main__':
 video = video_data()
 video.run()
</code></pre><h2>数据分析</h2>以下代码在Notebook上运行 首先我们需要导入自己需要用到的库<pre><code>import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS #导入模块worldcloud
from PIL import Image #导入模块PIL(Python Imaging Library)图像处理库
import numpy as np #导入模块numpy，多维数组
import matplotlib
import jieba
</code></pre><h3>数据预处理</h3>读取数据<pre><code>data=open(r'D:\video_data.csv',encoding='utf-8')
video_data = pd.read_csv(data)</code></pre>查看数据前五行<pre><code>video_data.head()#查看前五行</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1794480476.png" alt="前五行数据" title="前五行数据" style="">浏览数据的大概信息<pre><code>video_data.info()#视频数据的信息
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/316159067.png" alt="数据信息" title="数据信息" style="">对数据进行预处理，将None值换成0，将数字数据的格式换成int<pre><code>video_data['播放'].replace('None', 0,inplace = True)#将数据中标记为None的数据替换成0，方便数据处理
video_data['弹幕'].replace('None', 0,inplace = True)
video_data['评论'].replace('None', 0,inplace = True)
video_data['点赞'].replace('None', 0,inplace = True)
video_data['硬币'].replace('None', 0,inplace = True)
video_data['收藏'].replace('None', 0,inplace = True)
video_data['转发'].replace('None', 0,inplace = True)
video_data['排名'].replace('None', 0,inplace = True)
video_data['播放'] = video_data['播放'].astype(&quot;int64&quot;)#将数字的格式转换成int格式，用于数据处理
video_data['弹幕'] = video_data['弹幕'].astype(&quot;int&quot;)
video_data['评论'] = video_data['评论'].astype(&quot;int&quot;)
video_data['点赞'] = video_data['点赞'].astype(&quot;int&quot;)
video_data['硬币'] = video_data['硬币'].astype(&quot;int&quot;)
video_data['收藏'] = video_data['收藏'].astype(&quot;int&quot;)
video_data['转发'] = video_data['转发'].astype(&quot;int&quot;)
video_data['排名'] = video_data['排名'].astype(&quot;int&quot;)</code></pre>查看预处理之后的数据格式<pre><code>video_data.info()</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/2331605893.png" alt="转换后" title="转换后" style="">对标题进行预处理，只保留中文字符<pre><code>video_data['标题'] = video_data['标题'].str.replace(r'[^\u4e00-\u9fa5]','')#只保留中文
</code></pre>将标题分割成一个个短词<pre><code>video_data['标题'].fillna(' ',inplace=True) #将空值替换成空格
video_data['标题'] = video_data['标题'].apply(lambda x:' '.join(jieba.cut(x)))#将句子分割成一个个词语
video_data['标题'].head()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/114151688.png" alt="处理后的结果" title="处理后的结果" style="">同理处理标签<pre><code>#同理处理标签
video_data['标签'] = video_data['标签'].str.replace(',','')
video_data['标签'].fillna(' ',inplace=True)
video_data['标签'] = video_data['标签'].apply(lambda x:' '.join(jieba.cut(x)))
video_data['标签'].head()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/492638963.png" alt="标签取词结果" title="标签取词结果" style="">将时间信息转化成标准格式<pre><code>video_data.日期 = pd.to_datetime(video_data.日期.str.findall(r'\d{4}.+').str.get(0)) #将时间进行解析，转化为标准格式
video_data['weekday'] = video_data.日期.dt.weekday #获取星期几
video_data['hour'] = video_data.日期.dt.hour #获取小时
</code></pre>设置一个四舍五入代码<pre><code>#用于计算三连、弹幕、评论率
def new_round(_float, _len):
 if isinstance(_float, float):
 if str(_float)[::-1].find('.') &lt;= _len:
 return(_float)
 if str(_float)[-1] == '5':
 return(round(float(str(_float)[:-1]+'6'), _len))
 else:
 return(round(_float, _len))
 else:
 return(round(_float, _len))
</code></pre>计算三连等比率<pre><code>video_data['点赞率']=new_round(video_data.点赞/video_data.播放*100,0)
video_data['硬币率']=new_round(video_data.硬币/video_data.播放*100,0)
video_data['收藏率']=new_round(video_data.收藏/video_data.播放*100,0)
video_data['转发率']=new_round(video_data.转发/video_data.播放*100,0)
video_data['弹幕率']=new_round(video_data.弹幕/video_data.播放*100,0)
video_data['评论率']=new_round(video_data.评论/video_data.播放*100,1)
</code></pre>查看处理后的数据<pre><code>video_data.head()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/595134780.png" alt="处理后的数据" title="处理后的数据" style=""><h3>数据分析</h3>查看一共有几位UP主<pre><code>print('共有{}位UP，分别是'.format(len(video_data['UP'].unique())))#unique是将重复的去除
video_data['UP'].unique()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/4264157629.png" alt="UP主的数量" title="UP主的数量" style="">统计每个播放量区间的视频数量<pre><code># 计算每个播放量区间的视频数量
data_length = len(video_data)
data_length_rate0 = len(video_data[video_data['播放']&lt;10000])
data_length_rate1 = len(video_data[(video_data['播放']&gt;=10000) &amp; (video_data['播放']&lt;100000)])
data_length_rate2 = len(video_data[(video_data['播放']&gt;=100000) &amp; (video_data['播放']&lt;500000)])
data_length_rate3 = len(video_data[(video_data['播放']&gt;=500000) &amp; (video_data['播放']&lt;1000000)])
data_length_rate4 = len(video_data[video_data['播放']&gt;1000000])</code></pre>结果展示<pre><code>video_rate = [data_length_rate0,data_length_rate1,data_length_rate2,data_length_rate3,data_length_rate4]
data_view = ['[0,9999]','[10000,99999]','[100000,499999]','[500000,999999]','[1000000,....]']
video_rate
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/252521241.png" alt="结果查看" title="结果查看" style="">画图展示<pre><code># 画出饼图
plt.rcParams['font.sans-serif']=['SimHei'] # 中文不乱码
plt.rcParams['axes.unicode_minus'] = False
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
 data_view,
 fontsize=12,
 title=&quot;区间&quot;,
 loc=&quot;center left&quot;,
 bbox_to_anchor=(1, 0.9))
plt.title(&quot;播放量占比&quot;)
plt.show() 
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1750410179.png" alt="扇形图展示" title="扇形图展示" style="">只展示一万播放量以上的内容<pre><code>video_rate = [data_length_rate1,data_length_rate2,data_length_rate3,data_length_rate4]
data_view = ['[10000,99999]','[100000,499999]','[500000,999999]','[1000000,....]']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
 data_view,
 fontsize=12,
 title=&quot;区间&quot;,
 loc=&quot;center left&quot;,
 bbox_to_anchor=(1, 0.9))
plt.title(&quot;播放量占比&quot;)
plt.show() 
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/3312130753.png" alt="一万以上的展示图" title="一万以上的展示图" style="">统计展示播放量排名前二十的UP主<pre><code>#统计播放量排名前20的UP
top_20=video_data.sort_values(by=['播放'],ascending=False)[:20]
top_20['UP'].value_counts()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/2744268262.png" alt="数量展示" title="数量展示" style="">前20的具体数据<pre><code># 前20的具体数据
top_20[['UP','播放','粉丝']]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/4235206928.png" alt="前20的视频基本数据" title="前20的视频基本数据" style="">根据UP主分组，对每个UP八月的总播放量进行排序<pre><code>#根据UP主分组，对每个UP八月的总播放量进行排序
print(video_data.groupby('UP')['播放'].sum().sort_values(ascending=False)[:20])
top_1 = video_data[video_data['UP']=='大霓奈']
print(top_1['UP'].value_counts())
top_1 = video_data[video_data['UP']=='陈师姬']
print(top_1['UP'].value_counts())
top_1 = video_data[video_data['UP']=='17岁反派里的持枪Boy']
print(top_1['UP'].value_counts())
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/405809024.png" alt="播放量和展示" title="播放量和展示" style="">对每个UP主的弹幕数综合进行排序<pre><code># 对每个UP主的弹幕数综合进行排序
video_data.groupby('UP')['弹幕'].sum().sort_values(ascending=False)[:20]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/4235494730.png" alt="弹幕数展示" title="弹幕数展示" style="">评论数展示<pre><code># 对每个UP主的评论数综合进行排序
video_data.groupby('UP')['评论'].sum().sort_values(ascending=False)[:20]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/2808414200.png" alt="评论数展示" title="评论数展示" style="">视频数量展示<pre><code># 对八月份每个UP主发的视频数量进行统计
video_data['UP'].value_counts()[:20]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/3380581576.png" alt="视频数量展示" title="视频数量展示" style="">对每周不同时间段发布的视频数量进行统计<pre><code># 对每周不同时间段发布的视频数量进行统计
fig1, ax1=plt.subplots(figsize=(14,4))
df=video_data.groupby(['hour', 'weekday']).count()['mid'].unstack()
df.plot(ax=ax1, style='-.')
plt.show()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/4179627992.png" alt="视频数量随时间分布图" title="视频数量随时间分布图" style=""><pre><code># 对每周不同时间的视频播放量进行统计
fig2,ax2=plt.subplots(figsize=(14,4))
df=video_data.groupby(['hour','weekday']).sum()['播放'].unstack()
df.plot(ax=ax2,style='-.')
plt.show()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/3903313407.png" alt="视频播放量总和随时间分布" title="视频播放量总和随时间分布" style=""><pre><code># 对每周不同时间段发布的视频播放量大于10000的视频数量进行汇总
view_1 = video_data[video_data['播放']&gt;10000]
fig2,ax2=plt.subplots(figsize=(14,4))
df=view_1.groupby(['hour','weekday']).sum()['mid'].unstack()
df.plot(ax=ax2,style='-.')
plt.show()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1458496303.png" alt="大于一万播放量的视频数量随时间分布图" title="大于一万播放量的视频数量随时间分布图" style=""><pre><code>view_2 = video_data[video_data['播放']&gt;100000]
view_3 = video_data[video_data['播放']&gt;1000000]
</code></pre>用词云显示热词<pre><code>matplotlib.rcParams['font.sans-serif'] = ['KaiTi']#作图的中文
matplotlib.rcParams['font.serif'] = ['KaiTi']#作图的中文
infile = open(&quot;D:/stopwords.txt&quot;,encoding='utf-8')
stopwords_lst = infile.readlines()
STOPWORDS = [x.strip() for x in stopwords_lst] #去除头尾字符
stopwords = set(STOPWORDS) #设置停用词

def ciyun(texts,mid='all'): #支持指定UP主
    if mid == 'all':
        text = ' '.join(texts)
    else:
        text = ' '.join(texts[video_data['mid']==mid])

wc = WordCloud(font_path=&quot;msyh.ttc&quot;, background_color='white', max_words=100, stopwords=stopwords, max_font_size=80, random_state=42, margin=3) #配置词云参数
    wc.generate(text) #生成词云
    plt.imshow(wc,interpolation=&quot;bilinear&quot;)#作图
    plt.axis(&quot;off&quot;) #不显示坐标轴

ciyun(video_data['标题'])</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1822230669.png" alt="热词" title="热词" style="">大于一万视频的热词<pre><code>ciyun(view_1['标题'])
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1267505390.png" alt="热词" title="热词" style="">大于10万播放视频的热词<pre><code>ciyun(view_2['标题'])
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1035080606.png" alt="热词" title="热词" style="">大于100万播放视频的热词<pre><code>ciyun(view_3['标题'])
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1430339521.png" alt="热词" title="热词" style="">同理查看标签的热词<pre><code>ciyun(video_data['标签'])
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/3222928523.png" alt="热词" title="热词" style=""><pre><code>ciyun(view_1['标签'])
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/31297922.png" alt="热词" title="热词" style=""><pre><code>ciyun(view_2['标签'])
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/4169206333.png" alt="热词" title="热词" style=""><pre><code>ciyun(view_3['标签'])
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1576138282.png" alt="热词" title="热词" style="">统计标题中包含老师的视频数和播放量综合<pre><code># 统计标题中包含老师的视频数和播放量综合
video_teacher = video_data[video_data['标题'].str.contains(&quot;老师&quot;)]
teacher = [len(video_teacher),video_teacher['播放'].sum()]
teacher
</code></pre>[3022, 16610756]<pre><code>video_bro = video_data[video_data['标题'].str.contains(&quot;兄弟&quot;)]
brother = [len(video_bro),video_bro['播放'].sum()]
brother
</code></pre>[1897, 25270292]<pre><code>video_girlfriend = video_data[video_data['标题'].str.contains(&quot;女朋友&quot;)]
girlfriend = [len(video_girlfriend),video_girlfriend['播放'].sum()]
girlfriend
</code></pre>[830, 28265224]<pre><code># 包含女朋友的标题中包含兄弟的视频信息
fun = video_girlfriend[video_girlfriend['标题'].str.contains(&quot;兄弟&quot;)].drop_duplicates()
print(len(fun))
print(fun)
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/812799961.png" alt="女朋友的标题中包含兄弟的视频信息" title="女朋友的标题中包含兄弟的视频信息" style=""><pre><code>video_yidan = video_data[video_data['标题'].str.contains(&quot;一旦&quot;)]
yidan = [len(video_yidan),video_yidan['播放'].sum()]
yidan
</code></pre>[89, 28302099]<pre><code>video_wubei = video_data[video_data['标题'].str.contains(&quot;吾辈&quot;)]
wubei = [len(video_wubei),video_wubei['播放'].sum()]
wubei
</code></pre>[318, 35563900]<pre><code>video_waizui = video_data[video_data['标题'].str.contains(&quot;歪嘴&quot;)]
waizui = [len(video_waizui),video_waizui['播放'].sum()]
waizui
</code></pre>[1810, 70787655]查看带有这几个热词标题视频的个数饼状图<pre><code>video_rate = [teacher[0],brother[0],girlfriend[0],yidan[0],wubei[0],waizui[0]]
data_view = ['老师','兄弟','女朋友','一旦','吾辈','歪嘴']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
 data_view,
 fontsize=12,
 title=&quot;区间&quot;,
 loc=&quot;center left&quot;,
 bbox_to_anchor=(1, 0.9))
plt.title(&quot;数量占比&quot;)
plt.show() 
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/103150957.png" alt="饼图" title="饼图" style="">带有热刺标题的视频播放量饼图<pre><code>video_rate = [teacher[1],brother[1],girlfriend[1],yidan[1],wubei[1],waizui[1]]
data_view = ['老师','兄弟','女朋友','一旦','吾辈','歪嘴']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
 data_view,
 fontsize=12,
 title=&quot;区间&quot;,
 loc=&quot;center left&quot;,
 bbox_to_anchor=(1, 0.9))
plt.title(&quot;播放量占比&quot;)
plt.show()
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/2312876709.png" alt="饼图" title="饼图" style="">同理处理标签的<pre><code>video_bijian = video_data[video_data['标签'].str.contains(&quot;必剪&quot;)]
bijian = [len(video_bijian),video_bijian['播放'].sum()]
video_fun = video_data[video_data['标签'].str.contains(&quot;恶作剧&quot;)]
fun = [len(video_fun),video_fun['播放'].sum()]
video_tc = video_data[video_data['标签'].str.contains(&quot;吐槽&quot;)]
tc = [len(video_tc),video_tc['播放'].sum()]
video_beauty = video_data[video_data['标签'].str.contains(&quot;美女&quot;)]
beauty = [len(video_beauty),video_beauty['播放'].sum()]
video_wezy = video_data[video_data['标签'].str.contains(&quot;万恶之源&quot;)]
wezy = [len(video_wezy),video_wezy['播放'].sum()]
video_show = video_data[video_data['标签'].str.contains(&quot;表演&quot;)]
show = [len(video_show),video_show['播放'].sum()]
video_tuwei = video_data[video_data['标签'].str.contains(&quot;土味&quot;)]
tuwei = [len(video_tuwei),video_tuwei['播放'].sum()]

video_rate = [bijian[0],fun[0],tc[0],beauty[0],wezy[0],show[0],tuwei[0]]
data_view = ['必剪','恶作剧','吐槽','美女','万恶之源','表演','土味']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
 data_view,
 fontsize=12,
 title=&quot;区间&quot;,
 loc=&quot;center left&quot;,
 bbox_to_anchor=(1, 0.9))
plt.title(&quot;数量占比&quot;)
plt.show() 
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/3914545782.png" alt="饼图" title="饼图" style=""><pre><code>video_rate = [bijian[1],fun[1],tc[1],beauty[1],wezy[1],show[1],tuwei[1]]
data_view = ['必剪','恶作剧','吐槽','美女','万恶之源','表演','土味']
fig = plt.figure(figsize=(10,15))
plt.pie(video_rate,autopct='%1.2f%%') #画饼图（数据，数据对应的标签，百分数保留两位小数点）
plt.legend(
 data_view,
 fontsize=12,
 title=&quot;区间&quot;,
 loc=&quot;center left&quot;,
 bbox_to_anchor=(1, 0.9))
plt.title(&quot;数量占比&quot;)
plt.show() 
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1246096207.png" alt="饼图" title="饼图" style="">对大于1万播放量的视频三连率等进行排序<pre><code># 对大于1万播放量的视频三连率等进行排序
video_rate = video_data[video_data['播放']&gt;10000]
like_20=video_rate.sort_values(by=['点赞率'],ascending=False)[:20]
coin_20=video_rate.sort_values(by=['硬币率'],ascending=False)[:20]
sc_20=video_rate.sort_values(by=['收藏率'],ascending=False)[:20]
share_20=video_rate.sort_values(by=['转发率'],ascending=False)[:20]
danmu_20=video_rate.sort_values(by=['弹幕率'],ascending=False)[:20]
command_20=video_rate.sort_values(by=['评论率'],ascending=False)[:20]

like_20[['标题','播放','UP','点赞','点赞率']]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/3390342330.png" alt="点赞率前20" title="点赞率前20" style=""><pre><code>coin_20[['标题','播放','UP','硬币','硬币率']]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/2783574875.png" alt="硬币率前20" title="硬币率前20" style=""><pre><code>sc_20[['标题','播放','UP','收藏','收藏率']]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/4152588626.png" alt="收藏率前20" title="收藏率前20" style=""><pre><code>share_20[['标题','播放','UP','转发','转发率']]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/2861268934.png" alt="分享率前20" title="分享率前20" style=""><pre><code>danmu_20[['标题','播放','UP','弹幕','弹幕率']]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/1514254705.png" alt="弹幕率前20" title="弹幕率前20" style=""><pre><code>command_20[['标题','播放','UP','评论','评论率']]
</code></pre><img src="https://www.ndmiao.cn/usr/uploads/2020/09/2018548542.png" alt="评论率前20" title="评论率前20" style=""><h2>结论</h2><ul><li>生活搞笑区的视频中，大部分视频的播放量都集中在10000以下，占了93.86%。</li><li>要想获得高播放，则有三条途径：粉丝数、视频质量、视频数量。</li><li>每个月大量上传视频获取高播放完全有可能。播放总和最高的两位UP，一个投了154个视频，一个投了528个。</li><li>弹幕和评论则是粉丝数多的UP占优势，粉丝黏性高。</li><li>八月投放视频最多的UP是老年人诱捕大队队长，一共投放了6932个视频。</li><li>视频主要集中在10:00-24:00投放，这个区间的播放总和也是最高。</li><li>八月热词主要是龙王，节日相关，哔哩哔哩活动以及相关的UP主。</li><li>哔哩哔哩相关活动热词视频播放量普遍较低，UP相关的和月度梗相关的播放量收益最好。</li><li>三连率、弹幕率、转发率、评论率对视频播放量的影响不大。</li></ul><h2>资料</h2><button class=" btn m-b-xs btn-success btn-addon" onclick="window.open('https://github.com/ndmiao/bilibili-data/','_blank')">代码和数据</button>

哔哩哔哩八月生活搞笑区热度视频数据分析

数据爬取

确定目标

网站分析

IP池

代码

数据分析

数据预处理

数据分析

结论

资料

Leave a Comment Cancel reply
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

网站标题动态改变，js代码实现

删除WeTypecho插件后，控制台选项存在残留

自制一个简易的随机图片API接口

cute-cnblogs —— 一个好看的博客园魔改主题

百度网盘网页版看考研视频倍速播放

分别用SVM、贝叶斯分类、二叉树、CNN实现手写数字识别

python实现回归模型以及R方系数和均方差的计算

不做无用功 Google Colab掉线自动重连“助手”

selenium库的安装以及安装出现的问题的解决方法和简单使用

“Failed to get convolution algorithm. This is probably because cuDNN failed to initialize”错误的解决办法

哔哩哔哩八月生活搞笑区热度视频数据分析

数据爬取

确定目标

网站分析

IP池

代码

数据分析

数据预处理

数据分析

结论

资料

Leave a Comment Cancel reply 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

哔哩哔哩八月生活搞笑区热度视频数据分析

Leave a Comment Cancel reply
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款