Python爬虫入门——b站视频数据

2023-03-22

字数统计: 740 | 阅读时长≈ 3 分钟

通过 Python requests 和 beautiful soup 库实现爬取b站科技区首页视频的数据

import requests, time
from bs4 import BeautifulSoup      # beautiful soup 模块用来解析爬取的html内容

headers = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.44'
}
response=requests.get(f"https://www.bilibili.com/v/tech/", headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

首先，导入库，headers写好用于伪装浏览器访问，requests请求获取页面信息，用BS（BeautifulSoup）的 html.parser 解析页面数据
如上图善用元素搜索，右击网页点击检查进入如上界面，点击1处的按钮可以按照元素查找，点击标题文字，右边的 html 就自动跳到对应位置，我们发现标题都是以h3为包裹的，class 类别是 bili-video-card__info–tit，于是用 findAll 方法进行查找：

allmovie = soup.findAll("h3", attrs={"class": "bili-video-card__info--tit"})

我们遍历 allmovie 对于每一个 movie（视频）我们查找 a 元素，发现 video 是一个列表，仅一个元素：
所以直接用索引 0 获得 html文本，通过 .string 获取文本，即标题，通过 .get(‘href’) 获取 a 元素的属性（尖括号内部的变量），这部分代码如下：

for movie in allmovie:
    video = movie.findAll("a", attrs={"target": "_blank"})
    title = video[0].string       # 获得标题
    href = video[0].get('href')     # 获得链接

于是我们获取 href 链接，就可以进入每个视频的界面，再次通过 requests 请求页面：

    res = requests.get(f"https:{href}", headers=headers)
    sp = BeautifulSoup(res.text, "html.parser")

再次使用元素搜索获得数据：
发现数据在 span 元素，class 是 info-text，那就 findAll 获取它：

    datas = sp.findAll("span", attrs={"class": "info-text"})  # 视频数据

打印 datas 结果如下（仅一个视频）：
这是个列表，依次有点赞、投币、收藏、转发四个数据，接下来就很简单，依次获取即可：

    like = datas[0].string       # 点赞
    coin = datas[1].string       # 投币
    collect = datas[2].string    # 收藏
    transmit = datas[3].string   # 转发
    print(f"视频：{title}  点赞：{like}  投币：{coin}  收藏：{collect}  转发：{transmit}  链接：https:{href}")
    time.sleep(0.2)

按格式打印出结果，设置好请求间隔，不用太频繁，数据就慢慢地打印出来了：
完整代码

import requests, time
from bs4 import BeautifulSoup        # beautiful soup 模块用来解析爬取的html内容

headers = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.44'
}
response = requests.get(f"https://www.bilibili.com/v/tech/", headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
allmovie = soup.findAll("h3", attrs={"class": "bili-video-card__info--tit"})
for movie in allmovie:
    video = movie.findAll("a", attrs={"target": "_blank"})
    title = video[0].string       # 获得标题
    href = video[0].get('href')     # 获得链接
    res = requests.get(f"https:{href}", headers=headers)
    sp = BeautifulSoup(res.text, "html.parser")
    datas = sp.findAll("span", attrs={"class": "info-text"})            # 视频数据
    like = datas[0].string       # 点赞
    coin = datas[1].string       # 投币
    collect = datas[2].string    # 收藏
    transmit = datas[3].string   # 转发
    print(f"视频：{title}  点赞：{like}  投币：{coin}  收藏：{collect}  转发：{transmit}  链接：https:{href}")
    time.sleep(0.2)