【Python】PythonでWebスクレイピングを高速化！asyncioとaiohttpの実践活用

はじめに
通常のスクレイピングの課題
asyncioとaiohttpを用いた並行処理
実践例：ニュースサイトを並列取得する
応用：データ分析や業務自動化に展開
おわりに

はじめに

Webスクレイピングはデータ収集の基本手法ですが、1件ずつリクエストを処理すると待ち時間が長くなりがちです。特に、数十〜数百のページを対象にすると「遅すぎて実用に耐えない」という悩みを抱える人も多いでしょう。

本記事では、Pythonの**asyncioとaiohttp**を用いて、複数ページを並行して取得する方法を紹介します。これにより、スクレイピングの速度を数倍に引き上げることが可能です。

通常のスクレイピングの課題

まず、従来の同期的なスクレイピングコードを見てみましょう。

import requests

urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]

for url in urls:
    r = requests.get(url)
    print(url, len(r.text))

この方法では、1件ずつレスポンスが返るまで待機するため、件数が多いと時間がかかります。

asyncioとaiohttpを用いた並行処理

次に、asyncioとaiohttpを使って非同期処理を行う方法です。

import asyncio
import aiohttp

urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]

async def fetch(session, url):
    async with session.get(url) as response:
        text = await response.text()
        print(url, len(text))

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        await asyncio.gather(*tasks)

asyncio.run(main())

このコードでは、全てのリクエストを同時に投げ、レスポンスを並行して処理します。その結果、全体の実行時間を大幅に短縮できます。

実践例：ニュースサイトを並列取得する

例えば、ニュースサイトの最新記事を複数同時に取得したい場合、以下のように使えます。

urls = [f"https://news.example.com/article/{i}" for i in range(1, 21)]

20記事を従来の同期処理で取得すると数十秒かかることもありますが、並行処理を用いれば数秒で完了します。

さらに、BeautifulSoupと組み合わせてタイトルだけ抜き出すことも可能です。

from bs4 import BeautifulSoup

async def fetch_title(session, url):
    async with session.get(url) as response:
        text = await response.text()
        soup = BeautifulSoup(text, "html.parser")
        title = soup.find("title").text
        print(url, "->", title)