Skip to main content

Overview

Crawl systematically visits and extracts content from an entire website. Give it a starting URL, and it automatically discovers pages, follows links, and extracts clean structured data from every page it visits. Think of it as a smart robot that explores a website for you - reading every page and organizing all the content.

How it works

1

You provide a starting URL

Give Crawl the website or page URL where you want to start
2

Crawl discovers all pages

  • Reads sitemap.xml for URL lists
  • Follows internal links automatically
  • Discovers pages across the entire site
  • Respects depth and scope limits you set
3

Visits and extracts from each page

  • Systematically visits every discovered page
  • Parses content from each page
  • Structures the data consistently
  • Processes pages in parallel for speed
4

Delivers results as they're ready

Get organized data via webhook in real-time or poll for completed results

When to use Crawl

Full Site Data Collection

Extract data from hundreds or thousands of pages across an entire website

Product Catalog Scraping

Gather all product information from e-commerce sites automatically

Content Archiving

Create complete snapshots of websites for analysis or backup

Price Monitoring

Track pricing across entire catalogs over time

Common use cases

E-commerce data collection Scrape complete product catalogs including prices, descriptions, images, and specifications. Content migration Move content from old platforms to new systems by crawling and extracting all pages. Competitive analysis Monitor competitor websites for changes in products, pricing, or content strategy. SEO audits Analyze entire websites for content quality, structure, and optimization opportunities.

Crawl vs. other tools

What you needUse this
Data from entire websiteCrawl
Data from popular sites (Amazon, Google, etc.)Public Agent - maintained by Nimble
Data from sites not in the galleryCustom Agent - create with natural language
Data from specific URLs (expert users)Extract
Search web + extract content from resultsSearch
URLs with context for AI planningMap

How Crawl discovers pages

Crawl intelligently finds pages using multiple methods:
  • Sitemap analysis - Reads sitemap.xml for structured URL lists
  • Link following - Discovers pages by following internal navigation
  • Depth control - Set how deep to explore from the starting URL
  • Smart filtering - Include or exclude specific paths and patterns

Why use Crawl

  • Comprehensive - Get data from entire websites, not just single pages
  • Automated - No manual URL lists needed - Crawl finds everything
  • Efficient - Processes thousands of pages with optimal resource usage
  • Flexible - Control depth, scope, and which pages to include or exclude
  • Webhook support - Receive data in real-time as pages are processed

Example

Input: example-store.com/products
{
    "url": "https://www.bestbuy.com/site/searchpage.jsp?id=pcat17071&st=laptops",
    "sitemap": "include",
    "country": "US",
    "limit": 10
}
Output: Immediate response with crawl job details
{
    "id": "123e4567-e89b-12d3-a456-426614174000",
    "name": "string",
    "url": "https://www.bestbuy.com/site/searchpage.jsp?id=pcat17071&st=laptops",
    "status": "queued",
    "account_name": "your-account",
    "total": 1,
    "pending": 1,
    "completed": 0,
    "failed": 0,
    "created_at": "2026-02-09T10:25:59.512Z",
    "updated_at": "2026-02-09T10:25:59.512Z",
    "completed_at": null,
    "crawl_options": {
        "sitemap": "include",
        "crawl_entire_domain": false,
        "limit": 10,
        "max_discovery_depth": 5,
        "exclude_paths": [],
        "include_paths": [],
        "ignore_query_parameters": false,
        "allow_external_links": false,
        "allow_subdomains": false,
        "callback": null
    },
    "extract_options": null,
    "tasks": [
        {
            "webit_task_id": "0be6081e-90e3-458a-1234-36f5a51f0156",
            "crawl_id": "123e4567-e89b-12d3-a456-426614174000",
            "status": "pending",
            "url": "https://www.bestbuy.com/site/searchpage.jsp?id=pcat17071&st=laptops",
            "created_at": "2026-02-09T10:25:59.512Z",
            "updated_at": "2026-02-09T10:25:59.512Z"
        }
    ]
}
As pages are crawled, you receive crawl tasks either via webhook or by polling the status endpoint:
  • Use GET https://sdk.nimbleway.com/v1/crawl/{crawl_id}
{
    "crawl": {
        "id": "123e4567-e89b-12d3-a456-426614174000",
        "name": "string",
        "url": "https://www.bestbuy.com/site/searchpage.jsp?id=pcat17071&st=laptops",
        "status": "succeeded",
        "account_name": "your-account",
        "total": 10,
        "pending": 0,
        "completed": 10,
        "failed": 0,
        "created_at": "2026-02-09T10:25:59.512Z",
        "updated_at": "2026-02-09T10:26:12.164Z",
        "completed_at": "2026-02-09T10:30:00.000Z",
        "crawl_options": {
            "sitemap": "include",
            "crawl_entire_domain": false,
            "limit": 10,
            "max_discovery_depth": 5,
            "exclude_paths": [],
            "include_paths": [],
            "ignore_query_parameters": false,
            "allow_external_links": false,
            "allow_subdomains": false,
            "callback": null
        },
        "extract_options": null,
        "tasks": [
            {
                "task_id": "0be6081e-90e3-458a-1234-36f5a51f0156",
                "crawl_id": "123e4567-e89b-12d3-a456-426614174000",
                "status": "completed",
                "url": "https://www.bestbuy.com/site/searchpage.jsp?id=pcat17071&st=laptops",
                "created_at": "2026-02-09T10:25:59.512Z",
                "updated_at": "2026-02-09T10:26:12.164Z"
            },
            {
                "task_id": "0be6081e-90e3-1234-1234-36f5a51f0156",
                "crawl_id": "123e4567-e89b-12d3-a456-426614174000",
                "status": "completed",
                "url": "https://www.bestbuy.com/site/searchpage.jsp?af=false&id=pcat17071&st=laptops",
                "created_at": "2026-02-09T10:25:59.512Z",
                "updated_at": "2026-02-09T10:26:12.164Z"
            }
        ]
    }
}
The create crawl response returns webit_task_id in tasks, while the status endpoint returns task_id. Both refer to the same task identifier used to fetch results.
Then per crawled pages (tasks), you receive the task results:
  • Use GET https://sdk.nimbleway.com/v1/tasks/{task_id}/results
{
    "url": "https://www.bestbuy.com/site/searchpage.jsp?af=false&id=pcat17071&st=laptops",
    "task_id": "e8ed8ef6-2657-43ba-98d5-a5c79ea7b551",
    "status": "success",
    "task": {
        "id": "e8ed8ef6-2657-43ba-98d5-a5c79ea7b551",
        "state": "success",
        "created_at": "2026-02-09T10:26:05.817Z",
        "modified_at": "2026-02-09T10:26:05.817Z",
        "account_name": "your-account",
        "input": { ... }
    },
    "data": {
        "html": "...",
        "headers": { ... },
        "parsing": { ... }
    },
    "metadata": {
        "query_time": "2026-02-09T10:26:05.817Z",
        "query_duration": 1877,
        "response_parameters": {
            "input_url": "https://www.bestbuy.com/site/searchpage.jsp?af=false&id=pcat17071&st=laptops"
        },
        "driver": "vx6"
    },
    "status_code": 200
}

Key features

Async operation Crawl jobs run asynchronously - start a crawl and receive results via webhook or check status later. Scale control Set limits on pages, depth, and scope to match your needs and budget. Pattern matching Use include/exclude patterns to target specific sections or content types. Real-time results Get data as it’s extracted via webhooks, or poll for completed results.

SDK methods

MethodDescription
nimble.crawl({...})Create a new crawl job
nimble.crawl.list()List all crawls with pagination
nimble.crawl.status(crawl_id)Get crawl status and task list
nimble.crawl.terminate(crawl_id)Stop a running crawl
nimble.tasks.results(task_id)Get extracted content for a page

Next steps

Crawl Usage

See all parameters, code examples, and advanced features