第6章异步爬虫

1. 协程的基本 原理

1.1 案例 引入

https://www.http bin.org/delay/5
服务器强制等待5秒才能返回响应

import logging
import time
import requests

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(levelname)s: %(message)s')

URL = "https://www.httpbin.org/delay/5"
TOTAL_NUMBER = 10


start_time = time.time()

for _ in range(1, TOTAL_NUMBER):
    logging.info(f"scraping {URL}")
    response = requests.get(URL)

end_time = time.time()
logging.info(f"total time {end_time - start_time}")

1.2 基础知识

阻塞

阻塞状态：程序未得到所需计算资源时被挂起的状态
程序在操作上是阻塞的：程序在等待某个操作完成期间，自身无法继续干别的事情

非阻塞

程序在操作上非是阻塞的：程序在等待某操作的过程中，自身不被阻塞，可以继续干别的事情
仅当程序封装的级别可以囊括独立的子程序单元时，程序才可能存在非阻塞状态
因阻塞的存在而存在

同步

程序单元同步执行：不同程序单元为了共同完成某个任务，在执行过程中需要靠某种通信方式保持协调一致

异步

为了完成某种任务，有时不同程序单元之间无需通信协调也能完成任务

多进程

利用CPU的多核优势，在同i一时间并行执行多个任务，可以大大提高执行效率

协程

一种运行在用户态的轻量级线程
拥有自己的寄存器上下文和栈
调度切换
- 将寄存器上下文和栈保存到其他地方，等切回来时，再恢复先前保存的寄存器上下文和栈
能保留上一次调用时的状态，所有局部状态的一个特定组合，每次过程重入，就相当于进入上一次调用的状态
本质上是个单线程
实现异步操作

1.3 协程的用法

asyncio库
- event_loop：
  - 事件循环
  - 把函数注册到这个事件循环上，当满足发生条件时，就调用对应的处理方法
- coroutine：
  - 协程
  - 代指协程对象类型
  - 可以将协程对象注册到时间循环中，它会被事件循环调用
- task：
  - 任务
  - 对协程对象的进一步封装
    - 包含协程对象的各个状态
- future：
  - 将来执行或者没有执行的任务的结果
  - 和task 没有本质区别
- async关键字：
  - 定义一个方法（协程）
    - 这个方法在调用时不会立即被执行，而是会返回一个协程对象
- await关键字：
  - 挂起阻塞方法的执行

1.4 定义协程

import asyncio


async def execute(x):
    print(f"Number: {x}")

coroutine = execute(1)
print(f"Coroutine: {coroutine}")
print("After calling execute")

loop = asyncio.get_event_loop()
loop.run_until_complete(coroutine)
print("After calling loop")

把协程对象coroutine传递给run_until_complete方法的时候，实际上进行了一个操作
- 将corout ine封装成task 对象
显示声明 task

import asyncio


async def execute(x):
    print(f"Number: {x}")

coroutine = execute(1)
print(f"Coroutine: {coroutine}")
print("After calling execute")

loop = asyncio.get_event_loop()
task = loop.create_task(coroutine)
print(f"Task: {task}")

loop.run_until_complete(task)
print(f"Task: {task}")
print("After calling loop")

使用ensu re_future 定义tas k 对象

import asyncio


async def execute(x):
    print(f"Number: {x}")

coroutine = execute(1)
print(f"Coroutine: {coroutine}")
print("After calling execute")

task = asyncio.ensure_future(coroutine)
print(f"Task: {task}")

loop = asyncio.get_event_loop()
loop.run_until_complete(task)
print(f"Task: {task}")
print("After calling loop")

1.5 绑定 回调

为task对象绑定一个回调方法

import asyncio
import requests


async def request():
    url = "https://www.baidu.com/"
    status = requests.get(url)
    return status


def callback(task):
    print(f"Status: {task.result()}")


coroutine = request()
task = asyncio.ensure_future(coroutine)
task.add_done_callback(callback)
print(f"Task: {task}")

loop = asyncio.get_event_loop()
loop.run_until_complete(task)
print(f"Task: {task}")

等效于

import asyncio
import requests


async def request():
    url = "https://www.baidu.com/"
    status = requests.get(url)
    return status


coroutine = request()
task = asyncio.ensure_future(coroutine)
print(f"Task: {task}")

loop = asyncio.get_event_loop()
loop.run_until_complete(task)
print(f"Task: {task}")
print(f"Status: {task.result()}")

1.6 多任务协程

定义一个task列表，然后使用 wait方法执行

import asyncio
import requests


async def request():
    url = "https://www.baidu.com/"
    status = requests.get(url)
    return status


tasks = [asyncio.ensure_future(request()) for _ in range(5)]
print(f"Tasks: {tasks}")

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

for task in tasks:
    print(f"Task result: {task.result()}")

1.7 协程实现

单纯使用上述方法

import asyncio
import time
import requests


async def request():
    url = "https://www.httpbin.org/delay/5"
    print(f"Waiting for {url}")
    response = requests.get(url)
    print(f"Response: {response} from {url}")

start = time.time()

tasks = [asyncio.ensure_future(request()) for _ in range(5)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()
print(f"Cost time: {end - start}")

没有任何变化
要实现异步处理，需要先执行挂起操作
- 将耗时等待的操作挂起，让出控制权

import asyncio
import time
import requests


async def request():
    url = "https://www.httpbin.org/delay/5"
    print(f"Waiting for {url}")
    response = await requests.get(url)
    print(f"Response: {response} from {url}")

start = time.time()

tasks = [asyncio.ensure_future(request()) for _ in range(5)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()
print(f"Cost time: {end - start}")

报错
- req uests返回的Response对象不能和await一起使用
- await后面的对象必须是如下格式之一：
  - 一个原生协程对象
  - 一个由types.coroutine修饰的生成器
    - 这个生成器可以返回协程对象
  - 由一个包含__ await __方法的对象返回的一个迭代器
将请求页面的方法独立出来

import asyncio
import time
import requests


async def get(url):
    return requests.get(url)


async def request():
    url = "https://www.httpbin.org/delay/5"
    print(f"Waiting for {url}")
    response = await get(url)
    print(f"Response: {response} from {url}")

start = time.time()

tasks = [asyncio.ensure_future(request()) for _ in range(5)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()
print(f"Cost time: {end - start}")

依然没有任何变化

1.8 使用aiohttp

安装

pip3 install aiohttp

使用

import asyncio
import time
import aiohttp


async def get(url):
    session = aiohttp.ClientSession()
    response = await session.get(url)
    await response.text()
    await session.close()
    return response


async def request():
    url = "https://www.httpbin.org/delay/5"
    print(f"Waiting for {url}")
    response = await get(url)
    print(f"Response: {response} from {url}")

start = time.time()

tasks = [asyncio.ensure_future(request()) for _ in range(5)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()
print(f"Cost time: {end - start}")

2. aiohttp的使用

2.1 基本 介绍

asyncio：实现对TCP、UDP、SSL协议的异步操作
aiohttp：实现对HTTP请求的异步操作
aiohttp是基于asyncio的异步HTTP网络模块
- 既提供了服务端，又提供了客户端
  - 服务端
    - 可以搭建一个支持异步处理的服务器
      - 用于处理请求并返回响应
    - 类似于Django、Flask、Tornado等一些Web 服务器
  - 客户端
    - 发起请求
    - 类似于使用request发起一个HTTP请求然后获得响应
      - request发起的是同步网络请求
      - aiohttp发起的是异步网络请求

2.2 基本 实例

import aiohttp
import asyncio


async def fetch(session, url):
    # 上下文管理器，自动分配和释放资源
    async with session.get(url) as response:
        return await response.json(), response.status


async def main():
    # 上下文管理器，自动分配和释放资源
    async with aiohttp.ClientSession() as session:
        url = "https://www.httpbin.org/delay/5"
        html, status = await fetch(session, url)
        print(f"html: {html}")
        print(f"status: {status}")

if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    # 较高版本的python可以不显示声明事件循环
    asyncio.run(main())

2.3 URL参数 设置

import aiohttp
import asyncio


async def main():
    params = {"name": "abc", "age": 10}
    async with aiohttp.ClientSession() as session:
        async with session.get("https://www.httpbin.org/get", params=params) as response:
            print(await response.text())

if __name__ == "__main__":
    asyncio.run(main())

2.4 其他请求类型

session.post("https://www.httpbin.org/post", data=b"data")
session.put("https://www.httpbin.org/put", data=b"data")
session.delete("https://www.httpbin.org/delete")
session.head("https://www.httpbin.org/get")
session.options("https://www.httpbin.org/get")
session.patch("https://www.httpbin.org/patch", data=b"data")

2.5 POST请求

表单 提交

import aiohttp
import asyncio


async def main():
    data = {"name": "abc", "age": 10}
    async with aiohttp.ClientSession() as session:
        async with session.post("https://www.httpbin.org/post", data=data) as response:
            print(await response.text())

if __name__ == "__main__":
    asyncio.run(main())

JSON数据 提交

import aiohttp
import asyncio


async def main():
    data = {"name": "abc", "age": 10}
    async with aiohttp.ClientSession() as session:
        async with session.post("https://www.httpbin.org/post", json=data) as response:
            print(await response.text())

if __name__ == "__main__":
    asyncio.run(main())

2.6 响应

import aiohttp
import asyncio


async def main():
    data = {"name": "abc", "age": 10}
    async with aiohttp.ClientSession() as session:
        async with session.post("https://www.httpbin.org/post", data=data) as response:
            print(f"status: {response.status}")
            print(f"headers: {response.headers}")
            print(f"body: {await response.text()}")
            print(f"bytes: {await response.read()}")
            print(f"json: {await response.json()}")


if __name__ == "__main__":
    asyncio.run(main())

2.7 超时 设置

使用ClientTimeout对象设置超时

import aiohttp
import asyncio


async def main():
    # 设置2秒的超时时间
    timeout = aiohttp.ClientTimeout(total=2)
    async with aiohttp.ClientSession(timeout=timeout) as session:
        async with session.get("https://www.httpbin.org/get") as response:
            print(f"status: {response.status}")


if __name__ == "__main__":
    asyncio.run(main())

2.8 并发 限制

使用Semeaphore来控制并发量
- 防止网站在短时间内无法响应
- 避免将目标网站爬挂掉的风险

import aiohttp
import asyncio

# 爬取的最大并发量
CONCURRENCY = 5
URL = "https://www.baidu.com/"

# 创建一个信号量对象
semaphore = asyncio.Semaphore(CONCURRENCY)
session = None


async def scrape_api():
    # 信号量可以控制进入爬取的最大协程数量
    async with semaphore:
        print(f"Scraping {URL}")
        async with session.get(URL) as response:
            await asyncio.sleep(1)
            return await response.text()


async def main():
    global session
    session = aiohttp.ClientSession()
    scrape_index_tasks = [
        asyncio.ensure_future(
            scrape_api()) for _ in range(1000)]
    await asyncio.gather(*scrape_index_tasks)


if __name__ == "__main__":
    asyncio.run(main())

3. aiohttp异步 爬取 实战

3.1 案例 介绍

目标网站：Scrape | Book
- 网站数据由JavaScript渲染获得
- 数据可以通过Ajax 接口获取
- 接口没有设置任何反爬措施和加密参数
目标：
- 使用aiohttp爬取全站的数据
- 将数据通过异步的方式保存到 MongoDB 中

3.2 准备 工作

MongoDB数据库

3.3 页面分析

列表页的Ajax请求接口的格式：https://spa5.scrape.center/api/book/?limit=18&offset={offset}
- limit：每一页包含多少书
- offest：每一页的偏移量（
Ajax接口返回的数据中有每本书的id，利用id可以进入该书的详情页
详情页的Ajax请求接口的格式：https://spa5.scrape.center/detail/{id}
- id：书本身的ID

3.4 实现思路

爬取的两个阶段：
- 异步爬取所有列表页
  - 将所有列表页的爬取任务集合在一起，并将其声明为由task组成的列表，进行异步爬取
- 拿到上一步列表页的所有内容并解析
  - 将所有图书的id信息组合为所有详情页的爬取任务集合，并将其声明为task组成的列表，进行异步爬取，同时爬取结果也以异步方式存储到MongoDB中

3.5 基本 配置

import logging


logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(levelname)s: %(message)s')


INDEX_URL = "https://spa5.scrape.center/api/book/?limit=18&amp;offset={offset}"
DETAIL_URL = "https://spa5.scrape.center/detail/{id}"
PAGE_SIZE = 18
PAGE_NUMBER = 100
CONCURRENCY = 5

3.6 爬取列表页

实现

通用的爬取方法

import asyncio
import aiohttp


async def scrape_api(url):
    async with semaphore:
        try:
            logging.info(f"Scraping: {url}")
            # verify_ssl: 是否开启SSL认证
            async with session.get(url, verify_ssl=False) as response:
                return await response.json()
        except aiohttp.ClientError:
            logging.info(f"Error: {url}", exc_info=True)

爬取列表页

async def scrape_index(page):
    url = INDEX_URL.format(offset=PAGE_SIZE * (page - 1))
    return await scrape_api(url)

串联并用

import json


async def main():
    global session
    session = aiohttp.ClientSession()
    scrape_index_tasks = [
        asyncio.ensure_future(
            scrape_index(page)) for page in range(
            1, PAGE_NUMBER + 1)]
    results = await asyncio.gather(*scrape_index_tasks)
    logging.info(
        f"Results: {json.dumps(results, ensure_ascii=False, indent=2)}")

if __name__ == "__main__":
    asyncio.run(main())

合并

import json
import asyncio
import logging
import aiohttp

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(levelname)s: %(message)s')


INDEX_URL = "https://spa5.scrape.center/api/book/?limit=18&amp;offset={offset}"
DETAIL_URL = "https://spa5.scrape.center/detail/{id}"
PAGE_SIZE = 18
PAGE_NUMBER = 100
CONCURRENCY = 5

semaphore = asyncio.Semaphore(CONCURRENCY)
session = None


async def scrape_api(url):
    async with semaphore:
        try:
            logging.info(f"Scraping: {url}")
            async with session.get(url, verify_ssl=False) as response:
                return await response.json()
        except aiohttp.ClientError:
            logging.info(f"Error: {url}", exc_info=True)


async def scrape_index(page):
    url = INDEX_URL.format(offset=PAGE_SIZE * (page - 1))
    return await scrape_api(url)


async def main():
    global session
    session = aiohttp.ClientSession()
    scrape_index_tasks = [
        asyncio.ensure_future(
            scrape_index(page)) for page in range(
            1, PAGE_NUMBER + 1)]
    results = await asyncio.gather(*scrape_index_tasks)
    logging.info(
        f"Results: {json.dumps(results, ensure_ascii=False, indent=2)}")


if __name__ == "__main__":
    asyncio.run(main())

3.7 爬取详情页

实现

在main方法中将详情页的ID获取出来

ids = []
for index_data in results:
    if not index_data:
        continue
    for item in index_data.get("results"):
        ids.append(item.get("id"))

爬取详情页

async def scrape_detail(id):
    url = DETAIL_URL.format(id=id)
    data = await scrape_api(url)
    logging.info(f"Saving: {data}")

main方法增加对scrape_detail方法的调用

scrape_detail_tasks = [
        asyncio.ensure_future(
            scrape_detail(id)) for id in ids]
    await asyncio.wait(scrape_detail_tasks)
    await session.close()

合并

import json
import asyncio
import logging
import aiohttp

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(levelname)s: %(message)s')


INDEX_URL = "https://spa5.scrape.center/api/book/?limit=18&amp;offset={offset}"
DETAIL_URL = "https://spa5.scrape.center/api/book/{id}"
PAGE_SIZE = 18
PAGE_NUMBER = 100
CONCURRENCY = 5

semaphore = asyncio.Semaphore(CONCURRENCY)
session = None


async def scrape_api(url):
    async with semaphore:
        try:
            logging.info(f"Scraping: {url}")
            async with session.get(url, verify_ssl=False) as response:
                return await response.json()
        except aiohttp.ClientError:
            logging.info(f"Error: {url}", exc_info=True)


async def scrape_index(page):
    url = INDEX_URL.format(offset=PAGE_SIZE * (page - 1))
    return await scrape_api(url)


async def scrape_detail(id):
    url = DETAIL_URL.format(id=id)
    data = await scrape_api(url)
    logging.info(f"Saving {url}: {data}")


async def main():
    global session
    session = aiohttp.ClientSession()
    scrape_index_tasks = [
        asyncio.ensure_future(
            scrape_index(page)) for page in range(
            1, PAGE_NUMBER + 1)]
    results = await asyncio.gather(*scrape_index_tasks)
    logging.info(
        f"Results: {json.dumps(results, ensure_ascii=False, indent=2)}")

    ids = []
    for index_data in results:
        if not index_data:
            continue
        for item in index_data.get("results"):
            ids.append(item.get("id"))

    scrape_detail_tasks = [
        asyncio.ensure_future(
            scrape_detail(id)) for id in ids]
    await asyncio.wait(scrape_detail_tasks)
    await session.close()


if __name__ == "__main__":
    asyncio.run(main())

原文地址:https://blog.csdn.net/BlackOrnate/article/details/134735439

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。

如若转载，请注明出处：http://www.7code.cn/show_28676.html

如若内容造成侵权/违法违规/事实不符，请联系代码007邮箱：suwngjj01@126.com进行投诉反馈，一经查实，立即删除！

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

协程异步爬取

1. 协程的基本原理

1.1 案例引入

1.2 基础知识

非阻塞

多进程

1.3 协程的用法

1.4 定义协程

1.5 绑定回调

1.6 多任务协程

1.7 协程实现

1.8 使用aiohttp

2. aiohttp的使用

2.1 基本介绍

2.2 基本实例

2.3 URL参数设置

2.4 其他请求类型

2.5 POST请求

JSON数据提交

2.6 响应

2.7 超时设置

2.8 并发限制

3. aiohttp异步爬取实战

3.1 案例介绍

3.2 准备工作