Firecrawl：将网站转换为LLM就绪数据的API服务

技术背景

在人工智能应用开发中，获取高质量、干净的数据是至关重要的。Firecrawl就是这样一个API服务，它能够从任何网站获取干净的数据，具备高级的抓取、爬取和数据提取能力，为AI应用提供支持。

实现步骤

注册并获取API密钥

要使用Firecrawl的API，需要在Firecrawl上注册并获取一个API密钥。

选择使用方式

使用托管版本：提供了易于使用的API，可在其提供的playground和文档中找到相关信息。
自托管后端：如果有需要，也可以自行托管后端。可参考相关资源开始，如API文档、各种SDK（Python、Node、Go、Rust等）、LLM框架、低代码框架等。

本地运行

若要在本地运行，可参考相应的指南。

核心代码

利用curl调用API示例

爬取URL

curl -X POST https://api.firecrawl.dev/v1/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer fc-YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "limit": 10,
      "scrapeOptions": {
        "formats": ["markdown", "html"]
      }
    }'

抓取单个URL

curl -X POST https://api.firecrawl.dev/v1/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "formats" : ["markdown", "html"]
    }'

映射URL

curl -X POST https://api.firecrawl.dev/v1/map \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://firecrawl.dev"
    }'

搜索网络

curl -X POST https://api.firecrawl.dev/v1/search \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -d '{
    "query": "what is firecrawl?",
    "limit": 5
  }'

提取结构化数据

curl -X POST https://api.firecrawl.dev/v1/extract \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "urls": [
        "https://firecrawl.dev/*", 
        "https://docs.firecrawl.dev/", 
        "https://www.ycombinator.com/companies"
      ],
      "prompt": "Extract the company mission, whether it is open source, and whether it is in Y Combinator from the page.",
      "schema": {
        "type": "object",
        "properties": {
          "company_mission": {
            "type": "string"
          },
          "is_open_source": {
            "type": "boolean"
          },
          "is_in_yc": {
            "type": "boolean"
          }
        },
        "required": [
          "company_mission",
          "is_open_source",
          "is_in_yc"
        ]
      }
    }'

Python SDK示例

安装Python SDK

1	`pip install firecrawl-py`

爬取和抓取网站

from firecrawl.firecrawl import FirecrawlApp
from firecrawl.firecrawl import ScrapeOptions

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# 抓取网站
scrape_status = app.scrape_url(
  'https://firecrawl.dev', 
  formats=["markdown", "html"]
)
print(scrape_status)

# 爬取网站
crawl_status = app.crawl_url(
  'https://firecrawl.dev',
  limit=100,
  scrape_options=ScrapeOptions(
    formats=["markdown", "html"],),
  poll_interval=30
)
print(crawl_status)

Node SDK示例

安装Node SDK

1	`npm install @mendable/firecrawl-js`

使用Node SDK

import FirecrawlApp, { CrawlParams, CrawlStatusResponse } from '@mendable/firecrawl-js';

const app = new FirecrawlApp({apiKey: "fc-YOUR_API_KEY"});

// 抓取网站
const scrapeResponse = await app.scrapeUrl('https://firecrawl.dev', {
  formats: ['markdown', 'html'],
});

if (scrapeResponse) {
  console.log(scrapeResponse)
}

// 爬取网站
const crawlResponse = await app.crawlUrl('https://firecrawl.dev', {
  limit: 100,
  scrapeOptions: {
    formats: ['markdown', 'html'],
  }
} satisfies CrawlParams, true, 30) satisfies CrawlStatusResponse;

if (crawlResponse) {
  console.log(crawlResponse)
}

最佳实践

使用合适的格式：根据需求选择合适的输出格式，如markdown、结构化数据、截图、HTML等。
利用SDK：使用提供的Python、Node等SDK可以更方便地进行开发。
遵循网站规则：在进行抓取、搜索和爬取时，要尊重网站的政策，遵守适用的隐私政策和使用条款。

常见问题

自托管部署问题：该仓库仍在开发中，目前还不完全支持自托管部署，可关注后续更新。
API密钥问题：使用API需要在Firecrawl上注册并获取API密钥，确保密钥的正确性和有效性。
数据提取问题：在进行数据提取时，要确保提供的prompt和schema合理，以获得准确的结构化数据。

数据处理 > 网络爬虫

#数据处理 #Python/Node.js #Firecrawl #网站数据抓取 #LLM数据转换

Firecrawl：将网站转换为LLM就绪数据的API服务

https://119291.xyz/posts/firecrawl-transforming-websites-into-llm-ready-data/

作者

发布于

2025年7月25日

许可协议

A visual no - code theme editor for shadcn/ui components 上一篇

oTTomator Live Agent Studio平台开源AI智能体介绍下一篇