第二课：初试 request

requests 是 Python 的一个库，专门用于网络请求，是爬虫中最常用的工具

（一）最基本的 Get 请求

import requests

response=requests.get('http://www.douban.com/')  
# # 也可以写
response=requests.request('get',"http://www.baidu.com/")
print(response)   #<Response [418]>
# # 除了 response 还可以调用以下属性来获取关于 HTTP 响应的其他信息：
print(response.status_code) #这个属性返回 HTTP 响应的状态码，例如 200 表示成功，404 表示未找到页面，等等。
print(response.text)       #unicode数据， 这个属性返回响应的文本内容，通常用于获取网页的 HTML 内容。
print(response.content)     #字节流数据， 返回响应的二进制内容，可以用于获取图片、文件等二进制数据。
print(response.url)         #这个属性返回请求最终的URL，因为有时候请求会被重定向到其他URL。
print(response.cookies)     #这个属性包含了响应中的所有Cookies，你可以使用它来处理和管理Cookies。
print(response.headers)     #已经提到的属性，包含了响应的头部信息。
print(response.request)     #这个属性包含了与该响应相关联的请求对象，你可以通过它访问请求的属性和信息。
print(response.elapsed)     #这个属性返回请求的响应时间，通常用于性能分析。
print(response.history)     #如果请求经历了重定向，这个属性会返回一个包含所有重定向的响应历史列表。
print(response.encoding)    #查看响应字符头部编码

对于这里的 .text 和 .content, 使用上有一些区别

response.content 是字节格式，即使内容是 HTML，response.content 返回的仍然是字节数据，对字节流直接打印，会显示 b 前缀和编码格式。例如

b'\<!DOCTYPE html>\n\<html>\n\<head>\n\<title>Example\</title>\n\</head>\n\<body>\nHello!\</body>\n\</html>'

.decode() 将字节流转为字符串，输出更清晰的内容：

它依赖于字符编码（默认是 utf-8），告诉程序如何将二进制字节翻译成对应的字符。
如果确定响应是文本数据（如 HTML、JSON），可以直接用response.text，它自动完成解码操作

这比手动调用 .content.decode() 更简单，且效果相同

tip

但是如果服务器返回的内容使用了非 UTF-8 编码，手动调用 .decode() 是必要的：

例如，服务器可能返回 ISO-8859-1 编码的内容。

手动解码为对应编码： print(response.content.decode('ISO-8859-1'))

（二）headers 伪装

指定 headers 参数来修改访问使用的请求头

headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0'}
response=requests.get('http://douban.com/',headers=headers)
print(response)

（三）params参数与乱码解决

headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0'}

kw={'wd':"陈",  
    #注意这里的wb只使用于百度，因为只有百度是https://www.baidu.com/s?wd=chen，  豆瓣就变成q=chen了
    }
response=requests.get('http://www.baidu.com/s?',params=kw,headers=headers)
# print(response.text)
# text中会有形如ä½¿ç¨ç¾åº¦åå¿è¯的乱码，原因是网站和请求端的编码格式不一致
print(response.encoding)   #ISO-8859-1   这是request使用的编码
# 查看网页源代码，发现 <meta http-equiv="content-type" content="text/html;charset=utf-8">。是utf-8  
# 所以手动调整request格式
response.encoding="utf-8"
print(response.text)    #这样就不会出现乱码了

（四）最基本的post请求

# httpbin.org  这是一个专门提供测试的网站
# https://wordpress-edu-3autumn.localprod.oc.forchange.cn/  这也是一个专门提供测试的网站
url='http://httpbin.org/post'  #注意，这里只写httpbin.org或http://httpbin.org/是不行的
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0'}
data={
    'log': '1234',
    'pwd': '1234',
    'wp-submit': '登录',
    'redirect_to': 'https://wordpress-edu-3autumn.localprod.oc.forchange.cn',
    'testcookie': '1'
}  #这段提交的表单在 ‘网络’ 一栏的 ‘负载’ 里面找
response=requests.post(url,headers=headers,data=data)
print(response.status_code)

（五）序列化

服务器返回的内容很多时候是json数据结构，需要把他转换成python字典格式，

在flask框架中，要把python字典转化成json数据，我们使用flask自带的jsonify函数

方法一：json库

Python 标准库的一部分，适用于各种场景。

import json
json_string = '{"name": "John", "age": 30}'
python_dict = json.loads(json_string)
print(type(python_dict))
# 在这个例子中，`json.loads` 将 JSON 格式的字符串 `json_string` 转换为 Python 字典 `python_dict`。

<class 'dict'>

方法二：.json( )方法

注意是requests 库特有的，主要用于处理 HTTP 响应。

.json() 用于将服务器响应的 JSON 数据自动解析为 Python 字典（或其他数据结构，取决于 JSON 数据的格式）。

import requests
response = requests.get('https://api.example.com/data')
data = response.json()  
print(data)

（六）调用 API

我们之前已经讲了几种爬取的方法，其中百度网盘是直接分析HTML找到源代码，酷狗的方法实际上叫做JS逆向

其中对于HTML的处理，都是转化成text之后，可以用xpath分析，也可以用re正则表达式分析

在这里，我们引入第三种爬虫方法：API外链

各大网站都是有API接口的

网易：http://music.163.com/song/media/outer/url?id=1340439829

ＱＱ：http://y.qq.com/n/yqq/song/002B2EAA3brD5b.html

酷狗：http://www.kugou.com/song/#hash=08228af3cb404e8a4e7e9871bf543ff6

酷我：http://www.kuwo.cn/yinyue/382425/

虾米：http://www.xiami.com/song/2113248

百度：http://music.baidu.com/song/266069

咪咕：http://music.migu.cn/v2/music/song/477803

喜马拉雅：http://www.ximalaya.com/51701370/sound/24755731

163音乐：http://music.163.com/song/media/outer/url?id=210049

下面以网易云音乐为例

import os
import time
import requests
from lxml import etree


file_path=r"C:\Users\28121\Desktop\coding learning\爬虫学习\爬取资源" 
headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 Edg/118.0.2088.46",
    'cookie':'kg_mid=2632d7fd1ac9cee3f1396792e241152a; kg_dfid=3x7flr2ALA4A2mFgvv3Lch0V; kg_dfid_collect=d41d8cd98f00b204e9800998ecf8427e; Hm_lvt_aedee6983d4cfc62f509129360d6bb3d=1705216019,1705216125,1705219847; kg_mid_temp=2632d7fd1ac9cee3f1396792e241152a; Hm_lpvt_aedee6983d4cfc62f509129360d6bb3d=1705220424'
}

#访问歌单
url = 'https://music.163.com/playlist?id=934870683'  # 歌单的网址（记得删除/#，否则请求到的不是真是的网页）
response = requests.get(url, headers=headers).text  


# HTML文件分析
html = etree.HTML(response)
music_label_list = html.xpath('//a[contains(@href,"/song?")]')  #筛选出符合条件的a元素
print(len(music_label_list))


# 由于部分找到的a元素不是我们想要的（可能是用于视觉构型效果的），我们进行一次筛查
music_list=[]
for music_label in music_label_list:
    href = music_label.xpath('./@href')[0]  #注意xpath直接找到的都是列表！虽然里面只有一个元素，还是要加[0]
    music_ID = href.split('=')[1]
    # 判断字符串中是否是数字
    if (music_ID.isdigit()) :
        music_list.append(music_label)



print(len(music_list))


# 展示
for order,music_label in enumerate(music_list):
    href = music_label.xpath('./@href')[0]  #注意xpath直接找到的都是列表！虽然里面只有一个元素，还是要加[0]
    music_ID = href.split('=')[1]

    if music_label.xpath('./text()'):
        music_name = music_label.xpath('./text()')[0]
    else:
        music_name='无名称'
    print(f'{order+1}---{music_name}-----{music_ID}')


# 输入特定歌曲名称
order=int(input('请输入想下载第几首：'))-1
music_ID=music_list[order].xpath('./@href')[0].split('=')[1]
music_url = 'http://music.163.com/song/media/outer/url?id=' + music_ID
music_name=f"{music_list[order].xpath('./text()')[0]}.mp3"



# 保存歌曲
content=requests.get(music_url,headers=headers).content
fianl_path=os.path.join(file_path,music_name)
# print(fianl_path)
with open(fianl_path,'wb') as file:
    file.write(content)
print()
print(f'已下载歌曲 {music_name}')

（一）最基本的 Get 请求​

（二）headers 伪装​

（三）params参数 与 乱码解决​

（四）最基本的post请求​

（五）序列化​

方法一：json库​

方法二：.json( )方法​

（六）调用 API​