Skip to main content
Nimble Crawl (async) systematically visits and extracts content from entire websites. Give it a starting URL, and it automatically discovers pages through sitemaps and internal links, then extracts clean structured data from every page it visits.

Quick Start

Example Request

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.crawl({
    "url": "https://www.example.com",
    "limit": 100
})

crawl_id = result["crawl_id"]
print(f"Crawl started with ID: {crawl_id}")

Example Response

{
    "crawl_id": "e3ca2ff1-b82a-472b-b1a9-ef4d29cc549f",
    "name": null,
    "url": "https://www.example.com",
    "status": "queued",
    "account_name": "your-accound",
    "created_at": "2026-02-09T23:15:40.785Z",
    "updated_at": "2026-02-09T23:15:40.785Z",
    "completed_at": null,
    "crawl_options": {
        "sitemap": "include",
        "crawl_entire_domain": false,
        "limit": 10,
        "max_discovery_depth": 5,
        "ignore_query_parameters": false,
        "allow_external_links": false,
        "allow_subdomains": false
    },
    "extract_options": null
}

How it works

1

You submit a crawl request

Provide a starting URL and configure crawl options (limits, filters, extraction settings)
2

An async crawl job is created

  • Returns immediately with a crawl_id to track progress
  • The crawl runs in the background on Nimble’s infrastructure
  • Optional: Configure webhooks to receive real-time notifications
3

Crawl discovers and processes pages

  • Reads sitemaps and follows internal links
  • Creates individual tasks for each discovered URL
  • Extracts content from pages as they’re visited
  • Status updates live: track pending, completed, and failed counts
4

Retrieve results anytime

  • Poll crawl status to monitor progress
  • Fetch extracted content for completed tasks using task_id
  • Results persist after crawl completes for later retrieval

Parameters

Supported input parameters:
url
string
required
The starting point for your crawl. The crawler will begin here and discover other pages from this URL.Example: https://www.example.com
name
string
Give your crawl a memorable name. This helps you identify it later when you have multiple crawls running.Example: my-zillow-crawl
limit
integer
default:"5000"
Stop the crawl after finding this many pages.
  • Min: 1
  • Max: 10000
  • Default: 5000
sitemap
string
default:"include"
Decide how to use the website’s sitemap for discovering pages.Options:
  • include (default) - Use both the sitemap and discovered links
  • only - Just use the sitemap (fastest)
  • skip - Ignore the sitemap and only follow links
crawl_entire_domain
boolean
default:"false"
Let the crawler explore the entire domain, not just pages “under” your starting URL.For example, if you start at /blog, enabling this lets it also crawl /about and /contact.
allow_subdomains
boolean
default:"false"
Allow the crawler to follow links to subdomains.For example, from www.example.com to blog.example.com or shop.example.com.
include_paths
array
Only crawl pages whose URLs match these regex patterns.Example: ["/blog/.*", "/articles/.*"]
exclude_paths
array
Skip pages whose URLs match these regex patterns.Example: [".*/tag/.*", ".*/page/[0-9]+"]
max_discovery_depth
integer
default:"5"
Control how many “clicks away” from the starting page the crawler can go.
  • Min: 1
  • Max: 20
  • Default: 5
ignore_query_parameters
boolean
default:"false"
Treat URLs with different query parameters as the same page, preventing duplicate crawls.
callback
object
Get notified when your crawl completes or as pages are discovered.Configuration:
  • url (required) - String | Webhook URL to receive notifications
  • headers - Object | Custom headers for authentication
  • metadata - Object | Extra data to include in payloads
  • events - Array | Which events trigger notifications: started, page, completed, failed
Example:
{
	"callback":{
		"url":"",
		"headers":{},
		"metadata":{},
		"events":["started", "completed"]
	}
}
extract_options
object
Automatically extract content from each page as you crawl it. Accepts all Extract API options.**Example: **
{
	"extract_options":{
		"driver":"vx10",
		"parse":true,
		"formats": ["html", "markdown"]
	}
}
country
string
default:"ALL"
Crawl the site as if you’re browsing from a specific country.Use ISO Alpha-2 country codes like US, GB, FR, DE, CA, JP, etc. Use ALL for random country selection.
locale
string
Set the language preference for crawling. Use LCID standard.Locale Examples:
  • en-US - English (United States)
  • en-GB - English (United Kingdom)
  • fr-FR - French (France)
  • de-DE - German (Germany)

Usage

Basic crawl

Crawl a website using default settings:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.crawl({
    "url": "https://www.example.com"
})

crawl_id = result["id"]
print(f"Crawl started with ID: {crawl_id}")

Filter with URL patterns

Use include and exclude patterns to control which URLs are crawled:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.crawl({
    "url": "https://www.example.com",
    "include_paths": ["/blog/.*", "/articles/.*"],
    "exclude_paths": [".*/tag/.*", ".*/page/[0-9]+"],
    "limit": 500
})

print(result)

Crawl entire domain

Allow crawler to follow all internal links beyond the starting path:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.crawl({
    "url": "https://www.example.com/blog",
    "crawl_entire_domain": True,
    "limit": 2000
})

print(result)

Crawl with extraction

Extract structured data from each page during the crawl:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.crawl({
    "url": "https://www.example.com",
    "limit": 500,
    "extract_options": {
        "driver": "vx10",
        "parse": True,
        "formats": ["html", "markdown"]
    }
})

print(result)

Check crawl status

Get current status and progress of a crawl:

Get task results

Fetch extracted content for a specific crawled page:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

# Get crawl status first to find task IDs
response = nimble.crawl.status(crawl_id="your-crawl-id")
crawl = response["crawl"]

# Fetch results for each completed task
for task in crawl["tasks"]:
    if task["status"] == "completed":
        result = nimble.tasks.results(task_id=task["task_id"])
        print(f"URL: {result['url']}")

Combined parameters

Crawl with multiple parameters for precise control:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.crawl({
    "url": "https://www.example.com",
    "name": "Product Catalog Crawl",
    "sitemap": "include",
    "allow_subdomains": True,
    "include_paths": ["/products/.*"],
    "limit": 1000,
    "callback": {
        "url": "https://your-server.com/webhook",
        "events": ["completed"]
    }
})

print(f"Crawl started: {result['id']}")

Managing Crawls

0

List crawls

Get all your crawls filtered by status:
  • Available status filters: pending, in_progress, completed, failed, canceled
    from nimble_python import Nimble
    
    nimble = Nimble(api_key="YOUR-API-KEY")
    
    # List crawls by status
    response = nimble.crawl.list(status="completed")
    
    for crawl in response["data"]:
        print(f"Crawl {crawl['crawl_id']}: {crawl['name']} - {crawl['status']}")
    
2

Get crawl status (by crawl_id)

Check progress and get the list of task IDs for a specific crawl:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

response = nimble.crawl.status(crawl_id="e3ca2ff1-b82a-472b-b1a9-ef4d29cc549f")
crawl = response["crawl"]

print(f"Status: {crawl['status']}")
print(f"Progress: {crawl['completed']}/{crawl['total']} pages")
print(f"Tasks: {len(crawl['tasks'])}")
3

Get task results

Use the task_id from the crawl status response to fetch extracted content for each page:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

# Get crawl status to find task IDs
response = nimble.crawl.status(crawl_id="e3ca2ff1-b82a-472b-b1a9-ef4d29cc549f")
crawl = response["crawl"]

# Fetch results for each completed task
for task in crawl["tasks"]:
    if task["status"] == "completed":
        result = nimble.tasks.results(task_id=task["task_id"])
        print(f"URL: {result['url']}")
        print(f"Status Code: {result['status_code']}")
        # Access extracted data
        if "data" in result:
            print(f"HTML length: {len(result['data'].get('html', ''))}")
{
    "url": "https://www.example.com/page",
    "task_id": "ec89b1f7-1cf2-40eb-91b4-78716093f9ed",
    "status": "success",
    "task": {
        "id": "ec89b1f7-1cf2-40eb-91b4-78716093f9ed",
        "state": "success",
        "created_at": "2026-02-09T23:15:43.549Z",
        "modified_at": "2026-02-09T23:16:39.094Z",
        "account_name": "your-account"
    },
    "data": {
        "html": "<!DOCTYPE html>...",
        "markdown": "# Page Title\n\nContent...",
        "headers": { ... }
    },
    "metadata": {
        "query_time": "2026-02-09T23:15:43.549Z",
        "query_duration": 1877,
        "response_parameters": {
            "input_url": "https://www.example.com/page"
        }
    },
    "status_code": 200
}
4

Cancel crawl

Stop a running crawl. Completed tasks remain available for result retrieval:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.crawl.cancel(crawl_id="e3ca2ff1-b82a-472b-b1a9-ef4d29cc549f")

print(result)  # {"status": "canceled"}

Response Fields

When you use Crawl, you receive:
  • Async operation - Crawl jobs run in the background, check status or receive webhooks
  • Progress tracking - Monitor total, pending, completed, and failed counts
  • Task-based results - Each page becomes a task with extractable content
  • Webhook support - Get notified in real-time as pages are processed

Create Crawl Response

Returns immediate response with crawl job details
{
    "crawl_id": "e3ca2ff1-b82a-472b-b1a9-ef4d29cc549f",
    "name": null,
    "url": "https://www.example.com",
    "status": "queued",
    "account_name": "your-accound",
    "created_at": "2026-02-09T23:15:40.785Z",
    "updated_at": "2026-02-09T23:15:40.785Z",
    "completed_at": null,
    "crawl_options": {
        "sitemap": "include",
        "crawl_entire_domain": false,
        "limit": 10,
        "max_discovery_depth": 5,
        "ignore_query_parameters": false,
        "allow_external_links": false,
        "allow_subdomains": false
    },
    "extract_options": null
}
FieldTypeDescription
crawl_idstringUnique identifier for the crawl job
namestringOptional name you assigned to the crawl
urlstringStarting URL for the crawl
statusstringqueued, running, succeeded, failed, canceled
account_namestringYour account identifier
created_atstringTimestamp when crawl was created
updated_atstringTimestamp of last status update
completed_atstringTimestamp when crawl completed (null if in progress)
crawl_optionsobjectConfiguration settings applied to this crawl
extract_optionsobjectExtraction settings (null if not configured)

Get Crawl Status by ID Response

Returns the crawl object wrapped in a crawl key, with progress counters and task list:
{
    "crawl": {
        "crawl_id": "e3ca2ff1-b82a-472b-b1a9-ef4d29cc549f",
        "name": null,
        "url": "https://www.example.com",
        "status": "succeeded",
        "account_name": "your-account",
        "total": 10,
        "pending": 0,
        "completed": 10,
        "failed": 0,
        "created_at": "2026-02-09T23:15:40.785Z",
        "updated_at": "2026-02-09T23:17:08.083Z",
        "completed_at": "2026-02-09T23:17:08.079Z",
        "crawl_options": {
            "sitemap": "include",
            "crawl_entire_domain": false,
            "limit": 10,
            "max_discovery_depth": 5
        },
        "extract_options": null,
        "tasks": [
            {
                "task_id": "ec89b1f7-1cf2-40eb-91b4-78716093f9ed",
                "status": "completed",
                "updated_at": "2026-02-09T23:16:39.094Z",
                "created_at": "2026-02-09T23:15:43.549Z"
            },
            {
                "task_id": "3f6c136c-4bb5-44af-a21b-c8f1db708c2f",
                "status": "completed",
                "updated_at": "2026-02-09T23:16:45.033Z",
                "created_at": "2026-02-09T23:15:42.966Z"
            }
        ]
    }
}
FieldTypeDescription
crawl.crawl_idstringUnique identifier for the crawl job
crawl.statusstringCurrent crawl status
crawl.totalintegerTotal URLs discovered
crawl.pendingintegerURLs waiting to be processed
crawl.completedintegerSuccessfully processed URLs
crawl.failedintegerFailed URL extractions
crawl.tasksarrayList of individual page tasks
crawl.tasks[].task_idstringTask ID to use with GET /v1/tasks/{id}/results
crawl.tasks[].statusstringpending, processing, completed, failed

SDK methods

MethodDescription
nimble.crawl({...})Create a new crawl job
nimble.crawl.list()List all crawls with pagination
nimble.crawl.status(crawl_id)Get crawl status and task list
nimble.crawl.terminate(crawl_id)Stop a running crawl
nimble.tasks.results(task_id)Get extracted content for a page

Use cases

Full Site Data Collection

Extract data from hundreds or thousands of pages across an entire website

Product Catalog Scraping

Gather all product information from e-commerce sites automatically

Content Archiving

Create complete snapshots of websites for analysis or backup

Price Monitoring

Track pricing across entire catalogs over time

Real-world examples

Scenario: You need to gather all product information from a competitor’s online store.How Crawl helps:
  • Discovers all product pages through sitemaps and navigation
  • Extracts product details, prices, and descriptions from each page
  • Handles pagination and category structures automatically
  • Filters out cart, checkout, and account pages
Result: Complete product catalog data without manual URL collection.
Scenario: You’re migrating a blog to a new platform and need all content.How Crawl helps:
  • Finds all blog posts through sitemap and internal links
  • Extracts post content, metadata, and images
  • Excludes tag pages, author archives, and pagination
  • Preserves URL structure for redirects
Result: Complete content export ready for migration.
Scenario: You want to create an offline backup of documentation.How Crawl helps:
  • Maps entire documentation structure
  • Extracts content from all pages
  • Maintains hierarchy and navigation structure
  • Captures code examples and technical content
Result: Complete documentation archive for offline access.
Scenario: You need to track competitor pricing across their entire catalog.How Crawl helps:
  • Discovers all product pages automatically
  • Extracts pricing information from each page
  • Runs on schedule to track changes over time
  • Handles dynamic pricing and regional variations
Result: Comprehensive price intelligence data.
Scenario: You’re auditing a website’s content for SEO optimization.How Crawl helps:
  • Discovers all indexable pages
  • Extracts titles, meta descriptions, and headings
  • Identifies orphaned pages and broken links
  • Maps internal linking structure
Result: Complete SEO audit data for analysis.

Crawl vs Map

NeedUse
Extract content from pagesCrawl
Deep link followingCrawl
Complex filtering patternsCrawl
Webhook notificationsCrawl
Quick URL discovery onlyMap - completes in seconds
URL list with titles/descriptionsMap
Try Use Map first - Discover URLs quickly with Map, then use Crawl to extract content from the pages you need.

Next steps