Skip to main content
Nimble Crawl (async) systematically visits and extracts content from entire websites. Give it a starting URL, and it automatically discovers pages through sitemaps and internal links, then extracts clean structured data from every page it visits.

Quick Start

Example Request

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.crawl.run(
    url="https://www.nimbleway.com",
    limit=10
)

print(f"Crawl started with ID: {result.crawl_id}")

Example Response

{
  "crawl_id": "e3ca2ff1-b82a-472b-b1a9-ef4d29cc549f",
  "name": null,
  "url": "https://www.nimbleway.com",
  "status": "queued",
  "account_name": "your-account",
  "created_at": "2026-02-09T23:15:40.785Z",
  "updated_at": "2026-02-09T23:15:40.785Z",
  "completed_at": null,
  "crawl_options": {
    "sitemap": "include",
    "crawl_entire_domain": false,
    "limit": 10,
    "max_discovery_depth": 5,
    "ignore_query_parameters": false,
    "allow_external_links": false,
    "allow_subdomains": false
  },
  "extract_options": null
}

How it works

1

You submit a crawl request

Provide a starting URL and configure crawl options (limits, filters, extraction settings)
2

An async crawl job is created

  • Returns immediately with a crawl_id to track progress - The crawl runs in the background on Nimble’s infrastructure - Optional: Configure webhooks to receive real-time notifications
3

Crawl discovers and processes pages

  • Reads sitemaps and follows internal links - Creates individual tasks for each discovered URL - Extracts content from pages as they’re visited - Status updates live: track pending, completed, and failed counts
4

Retrieve results anytime

  • Poll crawl status to monitor progress - Fetch extracted content for completed tasks using task_id - Results persist after crawl completes for later retrieval

Parameters

Supported input parameters:
url
string
required
The starting point for your crawl. The crawler will begin here and discover other pages from this URL.Example: https://www.nimbleway.com
name
string
Give your crawl a memorable name. This helps you identify it later when you have multiple crawls running.Example: my-zillow-crawl
limit
integer
default:"5000"
Stop the crawl after finding this many pages.
  • Min: 1
  • Max: 10000
  • Default: 5000
extract_options
object
Automatically extract content from each page as you crawl it. Accepts all Extract API options.**Example: **
{
	"extract_options":{
		"driver":"vx10",
		"parse":true,
		"formats": ["html", "markdown"]
	}
}
sitemap
string
default:"include"
Decide how to use the website’s sitemap for discovering pages.Options:
  • include (default) - Use both the sitemap and discovered links
  • only - Just use the sitemap (fastest)
  • skip - Ignore the sitemap and only follow links
crawl_entire_domain
boolean
default:"false"
Let the crawler explore the entire domain, not just pages “under” your starting URL.For example, if you start at /blog, enabling this lets it also crawl /about and /contact.
allow_subdomains
boolean
default:"false"
Allow the crawler to follow links to subdomains.For example, from www.example.com to blog.example.com or shop.example.com.
include_paths
array
Only crawl pages whose URLs match these regex patterns.Example: ["/blog/.*", "/articles/.*"]
exclude_paths
array
Skip pages whose URLs match these regex patterns.Example: [".*/tag/.*", ".*/page/[0-9]+"]
max_discovery_depth
integer
default:"5"
Control how many “clicks away” from the starting page the crawler can go.
  • Min: 1
  • Max: 20
  • Default: 5
ignore_query_parameters
boolean
default:"false"
Treat URLs with different query parameters as the same page, preventing duplicate crawls.
callback
object
Get notified when your crawl completes or as pages are discovered.Configuration:
  • url (required) - String | Webhook URL to receive notifications
  • headers - Object | Custom headers for authentication
  • metadata - Object | Extra data to include in payloads
  • events - Array | Which events trigger notifications: started, page, completed, failed
Example:
{
	"callback":{
		"url":"",
		"headers":{},
		"metadata":{},
		"events":["started", "completed"]
	}
}
country
string
default:"ALL"
Crawl the site as if you’re browsing from a specific country.Use ISO Alpha-2 country codes like US, GB, FR, DE, CA, JP, etc. Use ALL for random country selection.
locale
string
Set the language preference for crawling. Use LCID standard.Locale Examples:
  • en-US - English (United States)
  • en-GB - English (United Kingdom)
  • fr-FR - French (France)
  • de-DE - German (Germany)

Usage

Basic crawl

Crawl a website using default settings:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.crawl.run(url="https://www.nimbleway.com")

print(f"Crawl started with ID: {result.crawl_id}")
print(f"Status: {result.status}")

Filter with URL patterns

Use include and exclude patterns to control which URLs are crawled:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.crawl.run(
    url="https://www.nimbleway.com",
    include_paths=["/blog/.*", "/use-cases/.*"],
    exclude_paths=[".*/careers/.*"],
    limit=500
)

print(f"Crawl ID: {result.crawl_id}")
print(f"Status: {result.status}")

Crawl entire domain

Allow crawler to follow all internal links beyond the starting path:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.crawl.run(
    url="https://www.nimbleway.com/blog",
    crawl_entire_domain=True,
    limit=2000
)

print(f"Crawl ID: {result.crawl_id}")
print(f"Status: {result.status}")

Crawl with extraction

Extract structured data from each page during the crawl:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.crawl.run(
    url="https://www.nimbleway.com",
    limit=500,
    extract_options={
        "driver": "vx10",
        "parse": True,
        "formats": ["html", "markdown"]
    }
)

print(f"Crawl ID: {result.crawl_id}")
print(f"Status: {result.status}")

Combined parameters

Crawl with multiple parameters for precise control:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.crawl.run(
    url="https://www.nimbleway.com",
    name="Nimble Website Crawl",
    sitemap="include",
    allow_subdomains=True,
    include_paths=["/use-cases/.*"],
    limit=1000,
    callback={
        "url": "https://your-server.com/webhook",
        "events": ["completed"]
    }
)

print(f"Crawl started: {result.crawl_id}")
print(f"Status: {result.status}")

Managing Crawls

0

List crawls

Get all your crawls filtered by status using the REST API:
  • Available status filters: pending, in_progress, completed, failed, canceled
    from nimble_python import Nimble
    
    nimble = Nimble(api_key="YOUR-API-KEY")
    
    # List crawls by status
    my_crawls = nimble.crawl.list()
    
    for crawl in my_crawls.data:
        print(f"Crawl {crawl.crawl_id}: {crawl.name} - {crawl.status}")
    
2

Get crawl status (by crawl_id)

Check progress and get the list of task IDs for a specific crawl using the REST API:
my_crawl = nimble.crawl.status(crawl_id)

print(f"Status: {my_crawl.status}")
3

Get task results

Use the task_id from the crawl status response to fetch extracted content for each page using the REST API:
# my_crawl["tasks"] from step #2 contains list of task IDs from status response
for task in my_crawl.tasks:
    if task.status == "completed":
        task_result = nimble.tasks.get(task_id)

        print(f"URL: {task_result['url']}")
        print(f"HTML length: {len(task_result['data'].get('html', ''))}")
{
    "url": "https://www.nimbleway.com/blog/post",
    "task_id": "ec89b1f7-1cf2-40eb-91b4-78716093f9ed",
    "status": "success",
    "task": {
        "id": "ec89b1f7-1cf2-40eb-91b4-78716093f9ed",
        "state": "success",
        "created_at": "2026-02-09T23:15:43.549Z",
        "modified_at": "2026-02-09T23:16:39.094Z",
        "account_name": "your-account"
    },
    "data": {
        "html": "<!DOCTYPE html>...",
        "markdown": "# Page Title\n\nContent...",
        "headers": { ... }
    },
    "metadata": {
        "query_time": "2026-02-09T23:15:43.549Z",
        "query_duration": 1877,
        "response_parameters": {
            "input_url": "https://www.nimbleway.com/blog/post"
        },
		"driver": "vx6"
    },
    "status_code": 200
}
4

Cancel crawl

Stop a running crawl using the REST API. Completed tasks remain available for result retrieval:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

response = nimble.crawl.terminate(crawl_id)

print(response)  # {"status": "canceled"}

Response Fields

When you use Crawl, you receive:
  • Async operation - Crawl jobs run in the background, check status or receive webhooks
  • Progress tracking - Monitor total, pending, completed, and failed counts
  • Task-based results - Each page becomes a task with extractable content
  • Webhook support - Get notified in real-time as pages are processed

Create Crawl Response

Returns immediate response with crawl job details
{
    "crawl_id": "e3ca2ff1-b82a-472b-b1a9-ef4d29cc549f",
    "name": null,
    "url": "https://www.nimbleway.com",
    "status": "queued",
    "account_name": "your-account",
    "created_at": "2026-02-09T23:15:40.785Z",
    "updated_at": "2026-02-09T23:15:40.785Z",
    "completed_at": null,
    "crawl_options": {
        "sitemap": "include",
        "crawl_entire_domain": false,
        "limit": 10,
        "max_discovery_depth": 5,
        "ignore_query_parameters": false,
        "allow_external_links": false,
        "allow_subdomains": false
    },
    "extract_options": null
}
FieldTypeDescription
crawl_idstringUnique identifier for the crawl job
namestringOptional name you assigned to the crawl
urlstringStarting URL for the crawl
statusstringqueued, running, succeeded, failed, canceled
account_namestringYour account identifier
created_atstringTimestamp when crawl was created
updated_atstringTimestamp of last status update
completed_atstringTimestamp when crawl completed (null if in progress)
crawl_optionsobjectConfiguration settings applied to this crawl
extract_optionsobjectExtraction settings (null if not configured)

Get Crawl Status by ID Response

Returns the crawl object wrapped in a crawl key, with progress counters and task list:
{
    "crawl": {
        "crawl_id": "e3ca2ff1-b82a-472b-b1a9-ef4d29cc549f",
        "name": null,
        "url": "https://www.nimbleway.com",
        "status": "succeeded",
        "account_name": "your-account",
        "total": 10,
        "pending": 0,
        "completed": 10,
        "failed": 0,
        "created_at": "2026-02-09T23:15:40.785Z",
        "updated_at": "2026-02-09T23:17:08.083Z",
        "completed_at": "2026-02-09T23:17:08.079Z",
        "crawl_options": {
            "sitemap": "include",
            "crawl_entire_domain": false,
            "limit": 10,
            "max_discovery_depth": 5
        },
        "extract_options": null,
        "tasks": [
            {
                "task_id": "ec89b1f7-1cf2-40eb-91b4-78716093f9ed",
                "status": "completed",
                "updated_at": "2026-02-09T23:16:39.094Z",
                "created_at": "2026-02-09T23:15:43.549Z"
            },
            {
                "task_id": "3f6c136c-4bb5-44af-a21b-c8f1db708c2f",
                "status": "completed",
                "updated_at": "2026-02-09T23:16:45.033Z",
                "created_at": "2026-02-09T23:15:42.966Z"
            }
        ]
    }
}
FieldTypeDescription
crawl.crawl_idstringUnique identifier for the crawl job
crawl.statusstringCurrent crawl status
crawl.totalintegerTotal URLs discovered
crawl.pendingintegerURLs waiting to be processed
crawl.completedintegerSuccessfully processed URLs
crawl.failedintegerFailed URL extractions
crawl.tasksarrayList of individual page tasks
crawl.tasks[].task_idstringTask ID to use with GET /v1/tasks/{id}/results
crawl.tasks[].statusstringpending, processing, completed, failed

SDK and API methods

MethodAvailabilityDescription
nimble.crawl.run(url=..., ...)Python SDKCreate a new crawl job
GET /v1/crawlREST APIList all crawls with pagination
GET /v1/crawl/{crawl_id}REST APIGet crawl status and task list
DELETE /v1/crawl/{crawl_id}REST APIStop a running crawl
GET /v1/tasks/{task_id}/resultsREST APIGet extracted content for a page
The Python SDK currently supports creating crawl jobs via nimble.crawl.run(). For crawl management operations (listing, status, cancellation) and retrieving task results, use the REST API directly as shown in the examples above.

Use cases

Full Site Data Collection

Extract data from hundreds or thousands of pages across an entire website

Product Catalog Scraping

Gather all product information from e-commerce sites automatically

Content Archiving

Create complete snapshots of websites for analysis or backup

Price Monitoring

Track pricing across entire catalogs over time

Real-world examples

Scenario: You need to gather all product information from a competitor’s online store.How Crawl helps:
  • Discovers all product pages through sitemaps and navigation
  • Extracts product details, prices, and descriptions from each page
  • Handles pagination and category structures automatically
  • Filters out cart, checkout, and account pages
Result: Complete product catalog data without manual URL collection.
Scenario: You’re migrating a blog to a new platform and need all content.How Crawl helps:
  • Finds all blog posts through sitemap and internal links
  • Extracts post content, metadata, and images
  • Excludes tag pages, author archives, and pagination
  • Preserves URL structure for redirects
Result: Complete content export ready for migration.
Scenario: You want to create an offline backup of documentation.How Crawl helps:
  • Maps entire documentation structure
  • Extracts content from all pages
  • Maintains hierarchy and navigation structure
  • Captures code examples and technical content
Result: Complete documentation archive for offline access.
Scenario: You need to track competitor pricing across their entire catalog.How Crawl helps:
  • Discovers all product pages automatically
  • Extracts pricing information from each page
  • Runs on schedule to track changes over time
  • Handles dynamic pricing and regional variations
Result: Comprehensive price intelligence data.
Scenario: You’re auditing a website’s content for SEO optimization.How Crawl helps:
  • Discovers all indexable pages
  • Extracts titles, meta descriptions, and headings
  • Identifies orphaned pages and broken links
  • Maps internal linking structure
Result: Complete SEO audit data for analysis.

Crawl vs Map

NeedUse
Extract content from pagesCrawl
Deep link followingCrawl
Complex filtering patternsCrawl
Webhook notificationsCrawl
Quick URL discovery onlyMap - completes in seconds
URL list with titles/descriptionsMap
Try Use Map first - Discover URLs quickly with Map, then use Crawl to extract content from the pages you need.

Next steps