バッチスクレイピング

複数のURLを一括スクレイピング

複数のURLを同時に一括スクレイピングできます。開始URLと任意のパラメータを引数に取ります。params 引数では、出力フォーマットなど、一括スクレイピングジョブの追加オプションを指定できます。

仕組み

/crawl エンドポイントの動作とほぼ同じです。バッチを開始して完了まで待つことも、開始して完了処理を自分で行うこともできます。

batchScrape（JS）/ batch_scrape（Python）：バッチジョブを開始し、完了まで待って結果を返します。
startBatchScrape（JS）/ start_batch_scrape（Python）：バッチジョブを開始し、ポーリングやウェブフックに使えるジョブIDを返します。

使い方

from firecrawl import Firecrawl

firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY")

start = firecrawl.start_batch_scrape([
    "https://firecrawl.dev",
    "https://docs.firecrawl.dev",
], formats=["markdown"])  # IDを返す

job = firecrawl.batch_scrape([
    "https://firecrawl.dev",
    "https://docs.firecrawl.dev",
], formats=["markdown"], poll_interval=2, wait_timeout=120)

print(job.status, job.completed, job.total)

レスポンス

batchScrape/batch_scrape を呼び出すと、バッチ完了時に完全な結果が返されます。

完了

{
  "status": "completed",
  "total": 36,
  "completed": 36,
  "creditsUsed": 36,
  "expiresAt": "2024-00-00T00:00:00.000Z",
  "next": "https://api.firecrawl.dev/v2/batch/scrape/123-456-789?skip=26",
  "data": [
    {
      "markdown": "[Firecrawl Docs のホームページ![light logo](https://mintlify.s3-us-west-1.amazonaws.com/firecrawl/logo/light.svg)!...",
      "html": "<!DOCTYPE html><html lang=\"en\" class=\"js-focus-visible lg:[--scroll-mt:9.5rem]\" data-js-focus-visible=\"\">...",
      "metadata": {
        "title": "Groq Llama 3 で「ウェブサイトと会話できる」機能を構築する | Firecrawl",
        "language": "en",
        "sourceURL": "https://docs.firecrawl.dev/learn/rag-llama3",
        "description": "Firecrawl、Groq Llama 3、LangChain を使って「自分のウェブサイトと会話できる」ボットを構築する方法を学びます。",
        "ogLocaleAlternate": [],
        "statusCode": 200
      }
    },
    ...
  ]
}

startBatchScrape/start_batch_scrape を呼び出すと、getBatchScrapeStatus/get_batch_scrape_status、API エンドポイント /batch/scrape/{id}、または Webhook を使って追跡できるジョブ ID が返されます。ジョブの結果は、完了後 24 時間まで API 経由で取得できます。この期間を過ぎても、activity logs からバッチスクレイプの履歴と結果を確認できます。

{
  "success": true,
  "id": "123-456-789",
  "url": "https://api.firecrawl.dev/v2/batch/scrape/123-456-789"
}

構造化抽出を伴うバッチスクレイプ

バッチスクレイプのエンドポイントを使って、ページから構造化データを抽出することもできます。これは、複数のURLから同一の構造化データを取得したい場合に便利です。

from firecrawl import Firecrawl

firecrawl = Firecrawl(api_key="fc-YOUR_API_KEY")

# 複数のサイトをスクレイプします：
batch_scrape_result = firecrawl.batch_scrape(
    ['https://docs.firecrawl.dev', 'https://docs.firecrawl.dev/sdks/overview'], 
    formats=[{
        'type': 'json',
        'prompt': 'ページのタイトルと説明を抽出してください。',
        'schema': {
            'type': 'object',
            'properties': {
                'title': {'type': 'string'},
                'description': {'type': 'string'}
            },
            'required': ['title', 'description']
        }
    }]
)
print(batch_scrape_result)

# あるいは start メソッドを使うこともできます：
batch_scrape_job = firecrawl.start_batch_scrape(
    ['https://docs.firecrawl.dev', 'https://docs.firecrawl.dev/sdks/overview'], 
    formats=[{
        'type': 'json',
        'prompt': 'ページのタイトルと説明を抽出してください。',
        'schema': {
            'type': 'object',
            'properties': {
                'title': {'type': 'string'},
                'description': {'type': 'string'}
            },
            'required': ['title', 'description']
        }
    }]
)
print(batch_scrape_job)

# その後、ジョブIDでバッチスクレイプのステータスを確認できます：
batch_scrape_status = firecrawl.get_batch_scrape_status(batch_scrape_job.id)
print(batch_scrape_status)

レスポンス

batchScrape/batch_scrape は完全な結果を返します：

完了

{
  "status": "completed",
  "total": 36,
  "completed": 36,
  "creditsUsed": 36,
  "expiresAt": "2024-00-00T00:00:00.000Z",
  "next": "https://api.firecrawl.dev/v2/batch/scrape/123-456-789?skip=26",
  "data": [
    {
      "json": {
        "title": "Build a 'Chat with website' using Groq Llama 3 | Firecrawl",
        "description": "Firecrawl、Groq Llama 3、LangChain を使って「自分のウェブサイトとチャットできる」ボットの作り方を解説します。"
      }
    },
    ...
  ]
}

startBatchScrape/start_batch_scrape はジョブ ID を返します：

{
  "success": true,
  "id": "123-456-789",
  "url": "https://api.firecrawl.dev/v2/batch/scrape/123-456-789"
}

Webhooks を使ったバッチスクレイプ

バッチ内の各 URL がスクレイプされるたびにリアルタイムで通知を受け取れるよう、webhook を設定できます。これにより、バッチ全体の完了を待たずに結果を即時に処理できます。

cURL

curl -X POST https://api.firecrawl.dev/v2/batch/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "urls": [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3"
      ],
      "webhook": {
        "url": "https://your-domain.com/webhook",
        "metadata": {
          "any_key": "any_value"
        },
        "events": ["started", "page", "completed"]
      }
    }'

イベントタイプ、ペイロード構造、実装例などを含む webhook の詳細なドキュメントは、Webhooks ドキュメントを参照してください。

クイックリファレンス

イベントタイプ:

batch_scrape.started - バッチスクレイプが開始されたとき
batch_scrape.page - 各URLのスクレイプに成功したとき
batch_scrape.completed - すべてのURLの処理が完了したとき
batch_scrape.failed - バッチスクレイプでエラーが発生した場合

基本ペイロード:

{
  "success": true,
  "type": "batch_scrape.page",
  "id": "batch-job-id",
  "data": [...], // 'page'イベントのページデータ
  "metadata": {}, // カスタムメタデータ
  "error": null
}

Webhook の詳細な構成方法、セキュリティのベストプラクティス、トラブルシューティングについては、Webhooks のドキュメントをご覧ください。

クイックスタート

新機能

基本機能

ウェブフック

開発者向けガイド

ユースケース

貢献する

バッチスクレイピング

複数のURLを一括スクレイピング

仕組み

使い方

レスポンス

構造化抽出を伴うバッチスクレイプ

レスポンス

Webhooks を使ったバッチスクレイプ

クイックリファレンス

クイックスタート

新機能

基本機能

ウェブフック

開発者向けガイド

ユースケース

貢献する

​複数のURLを一括スクレイピング

​仕組み

​使い方

​レスポンス

​構造化抽出を伴うバッチスクレイプ

​レスポンス

​Webhooks を使ったバッチスクレイプ

​クイックリファレンス

複数のURLを一括スクレイピング

仕組み

使い方

レスポンス

構造化抽出を伴うバッチスクレイプ

レスポンス

Webhooks を使ったバッチスクレイプ

クイックリファレンス