Parsing Schema gives you full control over data extraction using a comprehensive parser syntax. Define exact data structures for predictable, low-cost extraction from HTML, JSON, XML, and network captures. Parsers are the complete recipe for processing web content into structured data. They combine:

selectors - identify elements
extractors - extract data from elements
post-processors (optional) - transform the output

When to use

Use parsing schema when you need:

Predictable extraction: Same selectors extract same data every time
Full control: Specify exact selectors paths and data types
High volume: Process large datasets efficiently

Parsers may break when page structure or selectors change. Monitor source pages and update parsers as needed.

Parameters

parse - Required

parse

boolean

default:"false"

required

Must be set to true to enable parsing. This tells Nimble you want to extract structured data using the parser you define.When disabled, you’ll just get raw HTML without structured extraction.

Example:

"parse": true

parser

object

Your custom extraction recipe that defines exactly what data to pull from the page and how to structure it.Parser structure:Each field in your parser is a key-value pair where:

Key - The name of the field in your output (like "product_name" or "price")
Value - An object that describes how to extract that field

Every parser needs:

type - What kind of parser to use
- terminal - Extract a single value (like one price)
- terminal_list - Extract multiple values (like a list of image URLs)
- schema - Extract a nested object (like product details)
- schema_list - Extract a list of objects (like multiple products)
- or - Try multiple strategies, use first one that works
- and - Combine multiple extraction strategies
- const - Return a fixed value
selector - How to find the element on the page
- Use CSS selectors (.product-name, #price, etc.)
- Or XPath, JSON paths for other data types
extractor - What data to grab from the element
- text - The text content
- attr - An attribute value (like href or src)
- json - Parse JSON data
- raw - The raw HTML
post_processor (optional) - Transform the data
- Convert to number, format dates, clean text, etc.

Think of it as: “Find THIS element, grab THIS data from it, and format it like THIS”

Simple example:

"parser": {
  "title": {
    "type": "terminal",
    "selector": {
      "type": "css",
      "css_selector": "h1.product-title"
    },
    "extractor": {
      "type": "text"
    }
  }
}

List example:

"parser": {
  "images": {
    "type": "terminal_list",
    "selector": {
      "type": "css",
      "css_selector": ".product-gallery img"
    },
    "extractor": {
      "type": "attr",
      "attr": "src"
    }
  }
}

Usage

Example parser structure:

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.extract(
    url="https://www.example.com/product",
    parse=True,
    parser={
        "product_name": {  # key - output field name
            "type": "terminal",  # parser type
            "selector": {  # how to find the element
                "type": "css",
                "css_selector": ".product-title"
            },
            "extractor": {  # what to extract
                "type": "text"
            }
        },
        "price": {
            "type": "terminal",
            "selector": {
                "type": "css",
                "css_selector": ".price-value"
            },
            "extractor": {
                "type": "text",
                "post_processor": {  # optional: transform the data
                    "type": "number"
                }
            }
        }
    }
)

print(result)

Example Output

{
  "status": "success",
  "data": {
    "parsing": {
      "product_name": "Wireless Headphones",
      "price": 79.99
    }
  }
}

Parser Types

Supported parsing types:

terminal - Returns a single terminal/literal as output.
terminal_list - Returns a list of literals instead of a single literal.
schema - Returns a dictionary/JSON according to its field parsers.
schema_list - Returns a list of dictionaries/JSONs instead of a single dictionary/JSON.
or - Tries a sequence of parsers and returns the result of the first parser that returns a non-null value
and - Runs a sequence of schema parsers and merges their results into a single output. All parsers execute on the same input, and results are combined (first non-null value wins for overlapping keys).
const - Always returns its value regardless of the input. Useful for adding static data to your output.

terminal

The most basic parser.

{
  "type": "terminal",
  "selector": { ... },
  "extractor": { ... }  // Defaults to raw extractor if not specified
}

Show Example: Extract product name

{
  "type": "terminal",
  "selector": {
    "type": "css",
    "css_selector": ".product-name"
  },
  "extractor": {
    "type": "text"
  }
}

terminal_list

Returns a list of literals instead of a single literal.

{
  "type": "terminal_list",
  "selector": { ... },
  "extractor": { ... }  // Defaults to raw extractor if not specified
}

Show Example: Extract all image URLs

{
  "type": "terminal_list",
  "selector": {
    "type": "css",
    "css_selector": ".product-gallery img"
  },
  "extractor": {
    "type": "attr",
    "attr": "src"
  }
}

schema

This is the most commonly used parser for structured data extraction.

{
  "type": "schema",
  "selector": { ... },  // Optional
  "fields": {
    "field_name": { /* Parser */ }
  }
}

Show Example: Extract product details

{
  "type": "schema",
  "selector": {
    "type": "css",
    "css_selector": ".product-card"
  },
  "fields": {
    "name": {
      "type": "terminal",
      "selector": {
        "type": "css",
        "css_selector": ".product-name"
      },
      "extractor": {
        "type": "text"
      }
    },
    "price": {
      "type": "terminal",
      "selector": {
        "type": "css",
        "css_selector": ".price"
      },
      "extractor": {
        "type": "text"
      }
    }
  }
}

schema_list

Returns a list of dictionaries/JSONs instead of a single dictionary/JSON.

{
  "type": "schema_list",
  "selector": { ... },  // Optional
  "fields": {
    "field_name": { /* Parser */ }
  }
}

Show Exampe: With position tracking:

The optional position attribute adds an index field to each item in the output list.

{
  "type": "schema_list",
  "selector": { ... },
  "fields": {
    "field_name": { /* Parser */ }
  },
  "position": {
    "field_name": "index",
    "start_from": 1  // Optional, defaults to 0
  }
}

Show Example: Extract product list

{
  "type": "schema_list",
  "selector": {
    "type": "css",
    "css_selector": ".product-item"
  },
  "fields": {
    "name": {
      "type": "terminal",
      "selector": {
        "type": "css",
        "css_selector": ".product-name"
      },
      "extractor": {
        "type": "text"
      }
    },
    "price": {
      "type": "terminal",
      "selector": {
        "type": "css",
        "css_selector": ".price"
      },
      "extractor": {
        "type": "text"
      }
    }
  }
}

or

Useful for handling variations in page structure.

{
  "type": "or",
  "parsers": [
    {
      /* Parser 1 */
    },
    {
      /* Parser 2 */
    },
    {
      /* Parser 3 */
    }
  ]
}

Show Example: Try multiple price selectors

{
  "type": "or",
  "parsers": [
    {
      "type": "terminal",
      "selector": {
        "type": "css",
        "css_selector": ".sale-price"
      },
      "extractor": {
        "type": "text"
      }
    },
    {
      "type": "terminal",
      "selector": {
        "type": "css",
        "css_selector": ".regular-price"
      },
      "extractor": {
        "type": "text"
      }
    }
  ]
}

and

Runs a sequence of schema parsers and merges their results into a single output. All parsers execute on the same input, and results are combined (first non-null value wins for overlapping keys).

{
  "type": "and",
  "parsers": [
    {
      /* Schema Parser 1 */
    },
    {
      /* Schema Parser 2 */
    }
  ]
}

const

Always returns its value regardless of the input. Useful for adding static data to your output.

{
  "type": "const",
  "value": "some_value"
}

Parsing Selectors

Selectors identify elements (HTML, JSON, XML, Network) in the input web page. Supported selectors:

css - Selects elements matching a CSS selector.
xpath - Enables powerful element selection using XPath expressions. Particularly useful for XML documents like RSS feeds and sitemaps.
json - Extracts JSON elements from the page. All subsequent selectors and extractors receive JSON instead of HTML.
sequence - Combines multiple selectors in sequence. Useful for chaining different selector types.
parent - Traverses up the DOM tree (for HTML) or context hierarchy (for JSON). Useful when you need to select a parent element after finding a specific child.
root - Returns the original page (document). Often used with JSON selector to access fields like network_capture, url, or html.

css

Selects elements matching a CSS selector.

{
  "type": "css",
  "css_selector": "div.price"
}

Show Examples:

// Select by class
{ "type": "css", "css_selector": ".product-name" }

// Select by ID
{ "type": "css", "css_selector": "#main-content" }

// Select by attribute
{ "type": "css", "css_selector": "a[data-product-id]" }

// Complex selector
{ "type": "css", "css_selector": "div.product > h2.title" }

xpath

Enables powerful element selection using XPath expressions. Particularly useful for XML documents like RSS feeds and sitemaps.

{
  "type": "xpath",
  "path": "//book[@category='fiction']"
}

Show Common XPath patterns:

//element - Select all elements with the given name - /root/child - Select child elements of root - //element[@attr='value'] - Select by attribute value - //element[position()=1] - Select first element - //*[local-name()='element'] - Select ignoring namespaces

Show Example: Parse XML sitemap

{
  "type": "terminal_list",
  "selector": {
    "type": "xpath",
    "path": "//*[local-name()='loc']"
  },
  "extractor": {
    "type": "text"
  }
}

Show Example: Select nested RSS elements

{
  "type": "sequence",
  "sequence": [
    {
      "type": "xpath",
      "path": "//item"
    },
    {
      "type": "xpath",
      "path": ".//title"
    }
  ]
}

json

Extracts JSON elements from the page. All subsequent selectors and extractors receive JSON instead of HTML.

{
  "type": "json",
  "path": "nested.keys.in.the.json" // jsonpath
}

Show With coercion filter:

The coercion_filter field provides advanced control when dealing with multiple JSON objects. It uses JSONPath expressions to filter specific JSON objects.

{
  "type": "json",
  "coercion_filter": "$[1]", // Get the second JSON object
  "path": "nested.keys"
}

Show Example: Extract specific JSON by type

{
  "type": "json",
  "coercion_filter": "$[?(@.type=='product')]",
  "path": "name"
}

Show Example: Extract from embedded JSON in script tag

{
  "type": "terminal",
  "selector": {
    "type": "sequence",
    "sequence": [
      {
        "type": "css",
        "css_selector": "script[type='application/ld+json']"
      },
      {
        "type": "json",
        "path": "$.offers.price"
      }
    ]
  },
  "extractor": {
    "type": "raw"
  }
}

sequence

Combines multiple selectors in sequence. Useful for chaining different selector types.

{
  "type": "sequence",
  "sequence": [
    {
      /* Selector 1 */
    },
    {
      /* Selector 2 */
    },
    {
      /* Selector 3 */
    }
  ]
}

Show Example: Select JSON from HTML element

{
  "type": "sequence",
  "sequence": [
    {
      "type": "css",
      "css_selector": "script#product-data"
    },
    {
      "type": "json",
      "path": "$.product"
    }
  ]
}

parent

Traverses up the DOM tree (for HTML) or context hierarchy (for JSON). Useful when you need to select a parent element after finding a specific child.

{
  "type": "parent",
  "times": 1 // Number of levels to traverse (default: 1)
}

Show Example: Find parent container

{
  "type": "sequence",
  "sequence": [
    {
      "type": "css",
      "css_selector": "span.price"
    },
    {
      "type": "parent",
      "times": 2
    }
  ]
}

root

Returns the original page (document). Often used with JSON selector to access fields like network_capture, url, or html.

{
  "type": "root"
}

Show Example: Access page URL

{
  "type": "terminal",
  "selector": {
    "type": "sequence",
    "sequence": [
      {
        "type": "root"
      },
      {
        "type": "json",
        "path": "$.url"
      }
    ]
  },
  "extractor": {
    "type": "raw"
  }
}

Show Example: Parse network capture data

{
  "type": "terminal",
  "selector": {
    "type": "sequence",
    "sequence": [
      {
        "type": "root"
      },
      {
        "type": "json",
        "path": "$.network_capture"
      }
    ]
  },
  "extractor": {
    "type": "raw"
  }
}

Parsing Extractors

Extractors specify what data to extract from the selected element. Supported extractors:

text - Extracts the text content of the element. Works with both HTML (CSS selectors) and XML (XPath selectors).
- strip (optional, boolean) - If it is set to false, leading and trailing whitespaces are preserved in the text. Default is true.
- separator (optional, string) - Specifies a separator string to use when joining text from different child elements. When extracting text from nested HTML elements, this separator will be inserted between text from different elements. If not specified, text from different elements is concatenated without a separator.
attr - Extracts an attribute value from the element. Works with both HTML and XML elements. common attr:
- href - Links
- src - Images, scripts
- data-* - Custom data attributes
- class - CSS classes
- id - Element IDs
json - Extracts JSON content using JSONPath.
raw - Extracts an element as-is without coercion. JSON stays as JSON, strings stay as strings. Useful for advanced parsing with complex JSON selectors.

If no extractor is specified, the raw extractor is used by default.

text

Extracts the text content of the element. This extractor works with both HTML elements (from CSS selectors) and XML elements (from XPath selectors). You can use strip=false to keep leading and trailing whitespace characters. The default is to remove them.

The text extractor supports both HTML elements (BeautifulSoup Tag) from CSS selectors and XML elements (lxml Element) from XPath selectors. This allows you to use the same extractor regardless of whether you’re parsing HTML or XML documents.

Basic usage:

{
  "type": "text",
  "regex": "string" // Optional (deprecated), will return 1st match
}

Show With strip parameter:

{
  "type": "text",
  "regex": "string", // Optional (deprecated), will return 1st match
  "strip": false
}

Show With separator parameter:

{
  "type": "text",
  "regex": "string", // Optional (deprecated), will return 1st match
  "separator": " | "
}

Show Example with XPath Selector

{
  "type": "terminal",
  "selector": {
    "type": "xpath",
    "path": "//book/title"
  },
  "extractor": {
    "type": "text",
    "strip": true
  }
}

attr

Extracts an attribute value from the element. Works with both HTML and XML elements. Common attributes: href ,src ,data-* ,class , id

{
  "type": "attr",
  "attr": "href"
}

Show Example: Extract image URL

{
  "type": "terminal",
  "selector": {
    "type": "css",
    "css_selector": "img.product-image"
  },
  "extractor": {
    "type": "attr",
    "attr": "src"
  }
}

json

Extracts JSON content using JSONPath.

{
  "type": "json",
  "path": "nested.keys.in.the.json"
}

raw

Extracts an element as-is without coercion. JSON stays as JSON, strings stay as strings. Useful for advanced parsing with complex JSON selectors.

{
  "type": "raw"
}

Parsing Post Processors

Post processors transform extractor output. Define them in the extractor’s post_processor field. Supported post-procession options:

url - Converts relative URLs to absolute URLs based on the page origin.
regex - Transforms output using a regular expression. The optional group parameter extracts a specific capturing group (defaults to 0).
format - Formats input into a string using Python’s str.format, where the input is available as {data}.
date - Formats dates to ISO format or custom format.
boolean - Transforms output to boolean based on conditions: contains, exists, or regex. Use not: true to reverse the result.
number - Coerces output to a number (int or float). Handles formatted numbers like "1.5M" → 1500000 or "2,100" → 2100.
country - Converts country names to country codes.
sequence - Applies multiple post processors in sequence.

url

Converts relative URLs to absolute URLs based on the page origin.

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "url"
    }
  }
}

Show Example:

Input: "/news/article" - Output: "https://www.example.com/news/article"

regex

Transforms output using a regular expression. The optional group parameter extracts a specific capturing group (defaults to 0).

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "regex",
      "regex": "\\d+",
      "group": 0 // Optional
    }
  }
}

Show Example: Extract numbers

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "regex",
      "regex": "\\d+\\.\\d+"
    }
  }
}

Show Example: Extract capturing group

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "regex",
      "regex": "Price: (\\d+)\\.(\\d+)",
      "group": 1
    }
  }
}

format

Formats input into a string using Python’s str.format, where the input is available as {data}.

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "format",
      "format": "${data}"
    }
  }
}

Show Example:

- Input: 5.00 - Output: “$5.00”

date

Formats dates to ISO format or custom format.

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "date",
      "format": "%d/%m/%y" // Optional, defaults to ISO format
    }
  }
}

Show Example:

Input: "5 days ago" - Output (no format): "2024-07-29T00:00:00" - Output (with format %d/%m/%y): "29/07/2024"

boolean

Transforms output to boolean based on conditions: contains, exists, or regex. Use not: true to reverse the result.

Show Contains condition:

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "boolean",
      "condition": "contains",
      "contains": "InStock"
    }
  }
}

Show Exists condition:

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "boolean",
      "condition": "exists"
    }
  }
}

Show Regex condition:

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "boolean",
      "condition": "regex",
      "regex": "\\d+"
    }
  }
}

Show With negation:

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "boolean",
      "condition": "contains",
      "contains": "OutOfStock",
      "not": true
    }
  }
}

number

Coerces output to a number (int or float). Handles formatted numbers like "1.5M" → 1500000 or "2,100" → 2100.

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "number",
      "locale": "en", // Optional: locale for number parsing
      "force_type": "float" // Optional: "int" or "float"
    }
  }
}

Show Example with German locale:

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "number",
      "locale": "de"
    }
  }
}

Input: "1.000,50" → Output: 1000.50

country

Converts country names to country codes.

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "country"
    }
  }
}

Show Example:

Input: "United States" - Output: "US"

sequence

Applies multiple post processors in sequence.

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "sequence",
      "sequence": [
        {
          "type": "regex",
          "regex": "\\d+\\.\\d+"
        },
        {
          "type": "number"
        }
      ]
    }
  }
}

Show Example:

Input: "The price is $50.25!" - After regex: "50.25" - After number: 50.25

Complete Examples

Parsing a BBC News Article

This example demonstrates parsing a complete BBC news article about a three-legged cat, showing how to extract structured data from HTML using various parser types, selectors, and extractors.

Show Complete Walkthrough

Target URL: https://www.bbc.com/news/articles/cervlxymly2oTarget Schema:

{
  "url": "string",
  "title": "string",
  "date": "string",
  "author": {
    "name": "string",
    "organization": "string"
  },
  "images": ["string"],
  "paragraphs": ["string"]
}

Field-by-Field Breakdown

Show Parsing: url

The URL is the canonical link for the page, typically found in the HTML head tag under a link element with rel="canonical".HTML Structure:

<head>
  <link rel="canonical" href="https://www.bbc.com/news/articles/cervlxymly2o" />
</head>

Parser:

{
  "type": "terminal",
  "description": "Extracts the canonical URL from the page's head section",
  "selector": {
    "type": "css",
    "css_selector": "link[rel='canonical']"
  },
  "extractor": {
    "type": "attr",
    "attr": "href"
  }
}

Explanation:

Selector: link[rel='canonical'] selects the first link element with rel="canonical"
Extractor: attr with href extracts the URL from the href attribute

Show Parsing: title

The article title is contained in an h1 element within a headline block.HTML Structure:

<article>
  <div data-component="headline-block" class="sc-18fde0d6-0 eeiVGB">
    <h1 class="sc-518485e5-0 bWszMR">
      Three-legged cat 'brings town together'
    </h1>
  </div>
</article>

Parser:

{
  "type": "terminal",
  "description": "Main headline of the article",
  "selector": {
    "type": "css",
    "css_selector": "div[data-component='headline-block'] h1"
  },
  "extractor": {
    "type": "text"
  }
}

Explanation:

Selector: div[data-component='headline-block'] h1 targets the h1 inside the headline block
Extractor: text extracts the text content from the element

Show Parsing: date

The publication date is in a time element and needs to be formatted to ISO format.HTML Structure:

<div data-testid="byline-new" class="sc-2b5e3b35-0 dWFSHg">
  <div class="sc-2b5e3b35-1 jTEdni">
    <time class="sc-2b5e3b35-2 fkLXLN">29 July 2024</time>
  </div>
</div>

Parser:

{
  "type": "terminal",
  "description": "Publication date in ISO format",
  "selector": {
    "type": "css",
    "css_selector": "div[data-testid='byline-new'] time"
  },
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "date"
    }
  }
}

Explanation:

Selector: div[data-testid='byline-new'] time targets the time element
Extractor: text with date post-processor converts “29 July 2024” to “2024-07-29T00:00:00”

This field requires JavaScript rendering. Add render options to your API request to wait for this element.

Show Parsing: author (nested schema)

The author information contains both name and organization, requiring a nested schema parser.HTML Structure:

<div data-testid="byline-new-contributors" class="sc-2b5e3b35-12 bRrXa-D">
  <div class="sc-2b5e3b35-5 bpnWmT">
    <div>
      <span class="sc-2b5e3b35-7 bZCrck">Martin Heath</span>
      <div class="sc-2b5e3b35-8 gxaSLA">
        <span>BBC News, Northamptonshire</span>
      </div>
    </div>
  </div>
</div>

Parser:

{
  "type": "schema",
  "description": "Author information with name and organization",
  "selector": {
    "type": "css",
    "css_selector": "div[data-testid='byline-new-contributors']"
  },
  "fields": {
    "name": {
      "type": "terminal",
      "selector": {
        "type": "css",
        "css_selector": "span[class]"
      },
      "extractor": {
        "type": "text"
      }
    },
    "organization": {
      "type": "terminal",
      "selector": {
        "type": "css",
        "css_selector": "span:not([class])"
      },
      "extractor": {
        "type": "text"
      }
    }
  }
}

Explanation:

Schema Parser: Returns a nested object with multiple fields
Selector Nesting: Parent selector scopes child selectors to the byline-new-contributors div
Name Selector: span[class] selects spans with a class attribute
Organization Selector: span:not([class]) selects spans without a class attribute

Show Parsing: images (list)

Extract all image URLs from the article, converting relative URLs to absolute.HTML Structure:

<article>
  <img src="/news/480/cpsprodpb/2a87/live/321fae30.jpg.webp" />
  <img src="/news/480/cpsprodpb/a8c2/live/904194b0.jpg.webp" />
  <!-- more images -->
</article>

Parser:

{
  "type": "terminal_list",
  "description": "All article images with absolute URLs",
  "selector": {
    "type": "css",
    "css_selector": "article img"
  },
  "extractor": {
    "type": "attr",
    "attr": "src",
    "post_processor": {
      "type": "url"
    }
  }
}

Explanation:

terminal_list: Returns an array of values instead of a single value
Selector: article img selects all img elements within article
Extractor: attr with src gets the image source
Post-processor: url converts relative URLs to absolute (e.g., /news/... → https://www.bbc.com/news/...)

Show Parsing: paragraphs (list)

Extract all article paragraphs as an array of strings.HTML Structure:

<article>
  <div data-component="text-block" class="sc-18fde0d6-0 dlWCEZ">
    <p class="sc-eb7bd5f6-0 fYAfXe">
      A three-legged cat has captured a town's imagination...
    </p>
    <p class="sc-eb7bd5f6-0 fYAfXe">
      The people of Daventry, Northamptonshire, love taking photographs...
    </p>
    <!-- more paragraphs -->
  </div>
</article>

Parser:

{
  "type": "terminal_list",
  "description": "Article content paragraphs",
  "selector": {
    "type": "css",
    "css_selector": "article div[data-component='text-block'] p"
  },
  "extractor": {
    "type": "text"
  }
}

Explanation:

terminal_list: Returns an array of paragraph texts
Selector: article div[data-component='text-block'] p selects all p elements in text blocks
Extractor: text extracts the text content from each paragraph

Complete Parser

{
  "type": "schema",
  "description": "Parses a BBC news article into structured data",
  "fields": {
    "url": {
      "type": "terminal",
      "description": "Canonical URL of the article",
      "selector": {
        "type": "css",
        "css_selector": "link[rel='canonical']"
      },
      "extractor": {
        "type": "attr",
        "attr": "href"
      }
    },
    "title": {
      "type": "terminal",
      "description": "Main headline of the article",
      "selector": {
        "type": "css",
        "css_selector": "div[data-component='headline-block'] h1"
      },
      "extractor": {
        "type": "text"
      }
    },
    "date": {
      "type": "terminal",
      "description": "Publication date in ISO format",
      "selector": {
        "type": "css",
        "css_selector": "div[data-testid='byline-new'] time"
      },
      "extractor": {
        "type": "text",
        "post_processor": {
          "type": "date"
        }
      }
    },
    "author": {
      "type": "schema",
      "description": "Author information",
      "selector": {
        "type": "css",
        "css_selector": "div[data-testid='byline-new-contributors']"
      },
      "fields": {
        "name": {
          "type": "terminal",
          "selector": {
            "type": "css",
            "css_selector": "span[class]"
          },
          "extractor": {
            "type": "text"
          }
        },
        "organization": {
          "type": "terminal",
          "selector": {
            "type": "css",
            "css_selector": "span:not([class])"
          },
          "extractor": {
            "type": "text"
          }
        }
      }
    },
    "images": {
      "type": "terminal_list",
      "description": "All article images",
      "selector": {
        "type": "css",
        "css_selector": "article img"
      },
      "extractor": {
        "type": "attr",
        "attr": "src",
        "post_processor": {
          "type": "url"
        }
      }
    },
    "paragraphs": {
      "type": "terminal_list",
      "description": "Article content paragraphs",
      "selector": {
        "type": "css",
        "css_selector": "article div[data-component='text-block'] p"
      },
      "extractor": {
        "type": "text"
      }
    }
  }
}

Example Output

{
  "url": "https://www.bbc.com/news/articles/cervlxymly2o",
  "title": "Three-legged cat 'brings town together'",
  "date": "2024-07-29T00:00:00",
  "author": {
    "name": "Martin Heath",
    "organization": "BBC News, Northamptonshire"
  },
  "images": [
    "https://ichef.bbci.co.uk/news/480/cpsprodpb/2a87/live/321fae30-4c01-11ef-b2d2-cdb23d5d7c5b.jpg.webp",
    "https://ichef.bbci.co.uk/news/480/cpsprodpb/a8c2/live/904194b0-4c01-11ef-b2d2-cdb23d5d7c5b.jpg.webp",
    "https://ichef.bbci.co.uk/news/480/cpsprodpb/7579/live/9ecae4f0-4c01-11ef-aebc-6de4d31bf5cd.jpg.webp"
  ],
  "paragraphs": [
    "A three-legged cat has captured a town's imagination with his appearances in shops and offices.",
    "The people of Daventry, Northamptonshire, love taking photographs of the 14-year-old feline and documenting his travels on social media.",
    "Funds have been raised to buy a street sign with his name on it, and souvenir Salem T-shirts could follow."
  ]
}

API Request Example

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.extract(
    url="https://www.bbc.com/news/articles/cervlxymly2o",
    parse=True,
    render=True,
    render_flow=[{
        "wait_for": {
            "selectors": ["div[data-testid='byline-new'] time"]
        }
    }],
    parser={
        "type": "schema",
        "description": "Parses a BBC news article into structured data",
        "fields": {
            # ... (full parser structure as shown above)
        }
    }
)

print(result)

This parser can be reused for any BBC news article following the same structure - just change the URL!

Parsing Embedded JSON from Etsy.com Prodcut Page

This example demonstrates parsing structured data from embedded JSON-LD (Linked Data JSON) within an HTML page. Many websites embed JSON-LD in their HTML to help search engines understand their content - we can leverage this for easier, more reliable parsing.

What is LD+JSON? Linked Data JSON is a format for structuring data in a machine-readable way. It’s often embedded in webpages using <script type="application/ld+json"> tags to provide search engines with detailed information about products, articles, events, and more.

Show Complete Walkthrough

Target URL: https://www.etsy.com/il-en/listing/1487833925/crochet-pattern-flower-cat-hat-crochet

Finding Embedded JSONs: Search for the keyword "json" in the page’s source code. Look for <script type="application/ld+json"> tags.

The Embedded JSON

The Etsy product page contains this LD+JSON (simplified):

{
  "@type": "Product",
  "@context": "https://schema.org",
  "url": "https://www.etsy.com/il-en/listing/1487833925/...",
  "name": "CROCHET PATTERN - Flower Cat Hat...",
  "sku": "1487833925",
  "description": "***This is a CROCHET PATTERN...",
  "brand": {
    "@type": "Brand",
    "name": "HatsonCatsCrochet"
  },
  "image": [
    {
      "@type": "ImageObject",
      "contentURL": "https://i.etsystatic.com/.../il_fullxfull.jpg",
      "thumbnail": "https://i.etsystatic.com/.../il_340x270.jpg"
    }
  ],
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": "4.9",
    "reviewCount": 233
  },
  "offers": {
    "@type": "Offer",
    "price": "17.31",
    "priceCurrency": "ILS",
    "availability": "https://schema.org/InStock"
  }
}

Target Schema:

{
  "url": "string",
  "name": "string",
  "brand": "string",
  "category": "string",
  "sku": "string",
  "description": "string",
  "price": "number",
  "currency": "string",
  "image_urls": ["string"],
  "rating_score": "number",
  "rating_count": "number",
  "is_available": "boolean"
}

Selecting the JSON

First, we need to select the embedded JSON script element:

{
  "type": "schema",
  "selector": {
    "type": "css",
    "css_selector": "script[type='application/ld+json']:-soup-contains(Product)"
  },
  "fields": { ... }
}

The :-soup-contains(Product) suffix helps select the specific JSON containing "@type": "Product" when multiple LD+JSON scripts exist on the page.

Field-by-Field Breakdown

Show Simple top-level fields (url, name, category, sku, description)

These fields exist at the top level of the JSON and can be extracted directly using JSON path.Parser for url:

{
  "type": "terminal",
  "extractor": {
    "type": "json",
    "path": "url",
    "post_processor": {
      "type": "url"
    }
  }
}

Parser for name:

{
  "type": "terminal",
  "extractor": {
    "type": "json",
    "path": "name"
  }
}

Explanation:

JSON Extractor: Uses JSONPath expressions to navigate the JSON structure
Path: Simple field names for top-level values (e.g., "url", "name", "sku")
Post-processor: url post-processor ensures URLs are properly formatted

The same pattern applies to category, sku, and description fields.

Show Nested fields (brand, price, currency, rating_score, rating_count)

These fields are nested within objects in the JSON. We use dot notation in JSONPath to access them.Parser for brand:In the JSON, brand is an object:

{
  "brand": {
    "@type": "Brand",
    "name": "HatsonCatsCrochet"
  }
}

We want the name field inside brand:

{
  "type": "terminal",
  "extractor": {
    "type": "json",
    "path": "brand.name"
  }
}

Parser for price:

{
  "type": "terminal",
  "extractor": {
    "type": "json",
    "path": "offers.price",
    "post_processor": {
      "type": "number"
    }
  }
}

Parser for rating_score:

{
  "type": "terminal",
  "extractor": {
    "type": "json",
    "path": "aggregateRating.ratingValue",
    "post_processor": {
      "type": "number"
    }
  }
}

Explanation:

Nested Path: Use dot notation to access nested fields (e.g., "brand.name", "offers.price")
Number Post-processor: Converts string numbers to actual number types
JSONPath: Follow the object hierarchy: offers.price means “get the price field from the offers object”

Show Array fields (image_urls)

The image_urls field requires extracting specific values from an array of objects.JSON Structure:

{
  "image": [
    {
      "@type": "ImageObject",
      "contentURL": "https://i.etsystatic.com/.../image1.jpg",
      "thumbnail": "https://i.etsystatic.com/.../thumb1.jpg"
    },
    {
      "@type": "ImageObject",
      "contentURL": "https://i.etsystatic.com/.../image2.jpg",
      "thumbnail": "https://i.etsystatic.com/.../thumb2.jpg"
    }
  ]
}

Parser:

{
  "type": "terminal_list",
  "selector": {
    "type": "json",
    "path": "image[*]"
  },
  "extractor": {
    "type": "json",
    "path": "contentURL",
    "post_processor": {
      "type": "url"
    }
  }
}

Explanation:

terminal_list: Returns an array of strings instead of a single value
Selector: image[*] selects all elements in the image array
- The [*] notation tells JSONPath to iterate over all array elements
- Without [*], you’d get a single-element array containing the entire image array
Extractor: From each image object, extract the contentURL field
Result: Array of image URLs: ["url1.jpg", "url2.jpg", ...]

Show Boolean conversion (is_available)

The availability is stored as an enum string that needs to be converted to a boolean.JSON Structure:

{
  "offers": {
    "availability": "https://schema.org/InStock"
  }
}

Parser:

{
  "type": "terminal",
  "extractor": {
    "type": "json",
    "path": "offers.availability",
    "post_processor": {
      "type": "boolean",
      "condition": "contains",
      "contains": "InStock"
    }
  }
}

Explanation:

Path: Navigate to offers.availability
Boolean Post-processor: Converts the string to boolean
- condition: "contains" checks if the string contains a specific substring
- contains: "InStock" looks for “InStock” in the value
- Returns true if found, false otherwise
Result: "https://schema.org/InStock" → true

Complete Parser

{
  "type": "schema",
  "description": "Parses Etsy product from embedded LD+JSON",
  "selector": {
    "type": "css",
    "css_selector": "script[type='application/ld+json']:-soup-contains(Product)"
  },
  "fields": {
    "url": {
      "type": "terminal",
      "extractor": {
        "type": "json",
        "path": "url",
        "post_processor": {
          "type": "url"
        }
      }
    },
    "brand": {
      "type": "terminal",
      "extractor": {
        "type": "json",
        "path": "brand.name"
      }
    },
    "name": {
      "type": "terminal",
      "extractor": {
        "type": "json",
        "path": "name"
      }
    },
    "category": {
      "type": "terminal",
      "extractor": {
        "type": "json",
        "path": "category"
      }
    },
    "sku": {
      "type": "terminal",
      "extractor": {
        "type": "json",
        "path": "sku"
      }
    },
    "description": {
      "type": "terminal",
      "extractor": {
        "type": "json",
        "path": "description"
      }
    },
    "price": {
      "type": "terminal",
      "extractor": {
        "type": "json",
        "path": "offers.price",
        "post_processor": {
          "type": "number"
        }
      }
    },
    "currency": {
      "type": "terminal",
      "extractor": {
        "type": "json",
        "path": "offers.priceCurrency"
      }
    },
    "image_urls": {
      "type": "terminal_list",
      "selector": {
        "type": "json",
        "path": "image[*]"
      },
      "extractor": {
        "type": "json",
        "path": "contentURL",
        "post_processor": {
          "type": "url"
        }
      }
    },
    "rating_score": {
      "type": "terminal",
      "extractor": {
        "type": "json",
        "path": "aggregateRating.ratingValue",
        "post_processor": {
          "type": "number"
        }
      }
    },
    "rating_count": {
      "type": "terminal",
      "extractor": {
        "type": "json",
        "path": "aggregateRating.reviewCount",
        "post_processor": {
          "type": "number"
        }
      }
    },
    "is_available": {
      "type": "terminal",
      "extractor": {
        "type": "json",
        "path": "offers.availability",
        "post_processor": {
          "type": "boolean",
          "condition": "contains",
          "contains": "InStock"
        }
      }
    }
  }
}

Example Output

{
  "url": "https://www.etsy.com/listing/1487833925/crochet-pattern-flower-cat-hat-crochet",
  "brand": "HatsonCatsCrochet",
  "name": "CROCHET PATTERN - Flower Cat Hat Crochet Pattern Digital PDF, Sunflower Pet Hat Crochet Pattern, Cat Hat Crochet Pattern",
  "category": "Craft Supplies & Tools < Patterns & How To < Patterns & Blueprints",
  "sku": "1487833925",
  "description": "***This is a CROCHET PATTERN - you will get a PDF document online...",
  "price": 4.2,
  "currency": "EUR",
  "image_urls": [
    "https://i.etsystatic.com/42395346/r/il/31e0a6/5935941239/il_fullxfull.5935941239_bnss.jpg",
    "https://i.etsystatic.com/42395346/r/il/22aeda/5935940175/il_fullxfull.5935940175_3koh.jpg",
    "https://i.etsystatic.com/42395346/r/il/6994bb/5887867454/il_fullxfull.5887867454_9d7z.jpg"
  ],
  "rating_score": 4.9,
  "rating_count": 233,
  "is_available": true
}

API Request Example

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.extract(
    url="https://www.etsy.com/il-en/listing/1487833925/crochet-pattern-flower-cat-hat-crochet",
    parse=True,
    parser={
        "type": "schema",
        "selector": {
            "type": "css",
            "css_selector": "script[type='application/ld+json']:-soup-contains(Product)"
        },
        "fields": {
            # ... (full parser structure as shown above)
        }
    }
)

print(result)

Key Takeaways

Advantages of Parsing LD+JSON:

Cleaner data: JSON is already structured, no need to navigate complex HTML
More reliable: Less likely to break than HTML selectors when pages are redesigned
Richer data: Often contains data not visible on the page
Standardized: Follows Schema.org standards across many websites

JSONPath Tips:

Use . for nested objects: brand.name
Use [*] for arrays: image[*]
Combine them: reviews[*].author.name
Test JSONPath expressions at jsonpath.com

Parsing XML Document with XPath

XPath provides a powerful query language for parsing XML documents like RSS feeds, sitemaps, product catalogs, and other structured XML data. Unlike CSS selectors designed for HTML, XPath is specifically built for XML navigation.

When to use XPath:

Parsing RSS/Atom feeds
Extracting data from XML sitemaps
Processing XML APIs and data exports
Handling namespaced XML documents

Show Complete Walkthrough

Parsing an RSS Feed

RSS feeds are a common XML format for syndicating content. Let’s parse a typical RSS feed structure.XML Structure:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>Example RSS Feed</title>
    <link>https://example.com</link>
    <description>A sample RSS feed</description>
    <item>
      <title>Getting Started with XPath</title>
      <link>https://example.com/xpath-guide</link>
      <description>Learn how to use XPath for XML parsing</description>
      <pubDate>Mon, 01 Jan 2024 10:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Advanced XML Techniques</title>
      <link>https://example.com/xml-advanced</link>
      <description>Deep dive into XML parsing strategies</description>
      <pubDate>Tue, 02 Jan 2024 14:30:00 GMT</pubDate>
    </item>
  </channel>
</rss>

Target Schema:

{
  "feed_title": "string",
  "articles": [
    {
      "title": "string",
      "link": "string",
      "description": "string",
      "published_date": "string"
    }
  ]
}

Complete Parser:

{
  "type": "schema",
  "description": "Parses an RSS feed into structured data",
  "fields": {
    "feed_title": {
      "type": "terminal",
      "description": "The title of the RSS feed",
      "selector": {
        "type": "xpath",
        "path": "/rss/channel/title"
      },
      "extractor": {
        "type": "text"
      }
    },
    "articles": {
      "type": "schema_list",
      "description": "List of articles from the RSS feed",
      "selector": {
        "type": "xpath",
        "path": "//item"
      },
      "fields": {
        "title": {
          "type": "terminal",
          "selector": {
            "type": "xpath",
            "path": ".//title"
          },
          "extractor": {
            "type": "text"
          }
        },
        "link": {
          "type": "terminal",
          "selector": {
            "type": "xpath",
            "path": ".//link"
          },
          "extractor": {
            "type": "text"
          }
        },
        "description": {
          "type": "terminal",
          "selector": {
            "type": "xpath",
            "path": ".//description"
          },
          "extractor": {
            "type": "text"
          }
        },
        "published_date": {
          "type": "terminal",
          "selector": {
            "type": "xpath",
            "path": ".//pubDate"
          },
          "extractor": {
            "type": "text",
            "post_processor": {
              "type": "date"
            }
          }
        }
      }
    }
  }
}

Key Points:

Absolute XPath: /rss/channel/title starts from document root
Descendant selector: //item finds all item elements anywhere in the document
Relative XPath: .//title uses . to refer to the current context (the item element)
Date post-processor: Converts RFC 822 date format to ISO format

Parsing XML Sitemaps with Namespaces

XML sitemaps often include namespaces, which require special handling with the local-name() function.XML Structure:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/page1</loc>
    <lastmod>2024-01-01</lastmod>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://example.com/page2</loc>
    <lastmod>2024-01-02</lastmod>
    <priority>1.0</priority>
  </url>
</urlset>

Parser:

{
  "type": "schema",
  "description": "Parses an XML sitemap",
  "fields": {
    "urls": {
      "type": "schema_list",
      "selector": {
        "type": "xpath",
        "path": "//*[local-name()='url']"
      },
      "fields": {
        "location": {
          "type": "terminal",
          "selector": {
            "type": "xpath",
            "path": ".//*[local-name()='loc']"
          },
          "extractor": {
            "type": "text"
          }
        },
        "last_modified": {
          "type": "terminal",
          "selector": {
            "type": "xpath",
            "path": ".//*[local-name()='lastmod']"
          },
          "extractor": {
            "type": "text",
            "post_processor": {
              "type": "date"
            }
          }
        },
        "priority": {
          "type": "terminal",
          "selector": {
            "type": "xpath",
            "path": ".//*[local-name()='priority']"
          },
          "extractor": {
            "type": "text",
            "post_processor": {
              "type": "number"
            }
          }
        }
      }
    }
  }
}

Handling Namespaces:The local-name() function ignores XML namespaces. Instead of //*[name()='ns:url'] which requires namespace registration, use //*[local-name()='url'] to select elements by name alone.

Without local-name(): //ns:url (requires namespace prefix)
With local-name(): //*[local-name()='url'] (works regardless of namespace)

Parsing XML with Filtering (Book Catalog)

XPath predicates allow powerful filtering directly in the selector.XML Structure:

<?xml version="1.0" encoding="UTF-8"?>
<catalog>
  <book id="bk101" category="fiction">
    <title>The Great Gatsby</title>
    <author>F. Scott Fitzgerald</author>
    <price>10.99</price>
    <year>1925</year>
  </book>
  <book id="bk102" category="non-fiction">
    <title>A Brief History of Time</title>
    <author>Stephen Hawking</author>
    <price>15.99</price>
    <year>1988</year>
  </book>
  <book id="bk103" category="fiction">
    <title>1984</title>
    <author>George Orwell</author>
    <price>9.99</price>
    <year>1949</year>
  </book>
</catalog>

Show Example 1: Extract only fiction books

{
  "type": "schema_list",
  "description": "Extract only fiction books",
  "selector": {
    "type": "xpath",
    "path": "//book[@category='fiction']"
  },
  "fields": {
    "id": {
      "type": "terminal",
      "selector": {
        "type": "xpath",
        "path": "."
      },
      "extractor": {
        "type": "attr",
        "attr": "id"
      }
    },
    "title": {
      "type": "terminal",
      "selector": {
        "type": "xpath",
        "path": ".//title"
      },
      "extractor": {
        "type": "text"
      }
    },
    "author": {
      "type": "terminal",
      "selector": {
        "type": "xpath",
        "path": ".//author"
      },
      "extractor": {
        "type": "text"
      }
    },
    "price": {
      "type": "terminal",
      "selector": {
        "type": "xpath",
        "path": ".//price"
      },
      "extractor": {
        "type": "text",
        "post_processor": {
          "type": "number"
        }
      }
    }
  }
}

Key Points:

[@category='fiction'] is an XPath predicate that filters elements
. in XPath refers to the current context node (the selected book element)
Attribute extractor with attr: "id" extracts the id attribute value

Show Example 2: Find a specific book by ID

{
  "type": "schema",
  "description": "Extract a specific book by ID",
  "selector": {
    "type": "xpath",
    "path": "//book[@id='bk101']"
  },
  "fields": {
    "title": {
      "type": "terminal",
      "selector": {
        "type": "xpath",
        "path": ".//title"
      },
      "extractor": {
        "type": "text"
      }
    },
    "author": {
      "type": "terminal",
      "selector": {
        "type": "xpath",
        "path": ".//author"
      },
      "extractor": {
        "type": "text"
      }
    }
  }
}

The predicate [@id='bk101'] selects only the book with that specific ID.

Show Example 3: Books priced under $12

{
  "type": "schema_list",
  "description": "Extract books priced under $12",
  "selector": {
    "type": "xpath",
    "path": "//book[price < 12]"
  },
  "fields": {
    "title": {
      "type": "terminal",
      "selector": {
        "type": "xpath",
        "path": ".//title"
      },
      "extractor": {
        "type": "text"
      }
    },
    "price": {
      "type": "terminal",
      "selector": {
        "type": "xpath",
        "path": ".//price"
      },
      "extractor": {
        "type": "text",
        "post_processor": {
          "type": "number"
        }
      }
    }
  }
}

XPath predicates support numeric comparisons: [price < 12] filters books by price.

Common XPath Patterns

Selecting Elements:

Pattern	Description	Example
`//element`	All elements with name anywhere	`//book`
`/root/child`	Direct child from root	`/catalog/book`
`.//element`	Descendants of current node	`.//title`
`..`	Parent of current node	`..`
`.`	Current node	`.`

Filtering with Predicates:

Pattern	Description	Example
`[@attr='value']`	Element with attribute value	`//book[@category='fiction']`
`[position()=1]` or `[1]`	First element	`//book[1]`
`[last()]`	Last element	`//item[last()]`
`[price < 10]`	Numeric comparison	`//book[price < 10]`
`[@attr]`	Has attribute	`//book[@id]`

Handling Namespaces:

Pattern	Description	Example
`local-name()='element'`	Ignore namespace	`//*[local-name()='url']`
`name()='prefix:element'`	Match with prefix	`//*[name()='atom:link']`

Advanced Patterns:

Pattern	Description
`//book[@category='fiction' and price < 12]`	Multiple AND conditions
`//book[@category='fiction' or @category='science']`	OR conditions
`//book[contains(@id, 'bk10')]`	String contains check
`//book[position() > 1 and position() < 5]`	Range selection

API Request Example

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.extract(
    url="https://example.com/feed.xml",
    parse=True,
    parser={
        "type": "schema",
        "fields": {
            "feed_title": {
                "type": "terminal",
                "selector": {
                    "type": "xpath",
                    "path": "/rss/channel/title"
                },
                "extractor": {
                    "type": "text"
                }
            },
            "articles": {
                "type": "schema_list",
                "selector": {
                    "type": "xpath",
                    "path": "//item"
                },
                "fields": {
                    # ... (fields as shown above)
                }
            }
        }
    }
)

print(result)

Best Practices for XML Parsing

Use Relative XPath in Nested ParsersWhen working with nested schema parsers, use relative XPath expressions (starting with .//) to keep selectors scoped to the parent element. This makes parsers more maintainable and performant.

Namespace HandlingWhen parsing XML with namespaces (like sitemaps, Atom feeds), use local-name() to ignore namespaces unless you need to distinguish between elements with the same name in different namespaces.

Element-Only ResultsXPath selectors only return Element nodes. Use the text or attr extractors to get data from the selected elements. Don’t try to select text nodes directly with //title/text().

Combine with Post-ProcessorsUse post-processors to convert extracted text to appropriate data types:

date for timestamps and dates
number for numeric values
url for making relative URLs absolute

Combining XPath with Other Selectors

You can combine XPath with CSS selectors using sequence selectors:

{
  "type": "sequence",
  "sequence": [
    {
      "type": "css",
      "css_selector": "script[type='application/xml']"
    },
    {
      "type": "xpath",
      "path": "//item"
    }
  ]
}

This first uses CSS to find a script tag containing XML, then applies XPath to parse that XML.

Parsing Network API Calls from Target.com Product Page

Modern web applications often load data dynamically through API calls rather than embedding it directly in HTML. Nimble’s network capture feature records these API responses, allowing you to parse structured JSON data directly from backend endpoints - often cleaner and more reliable than parsing the rendered HTML.

What is Network Capture?Network capture records API calls made by the browser while loading a page. This gives you access to the raw JSON responses from backend services, which often contain more complete data than what’s visible in the HTML.

Show Complete Walkthrough

Target URL: https://www.target.com/p/A-87562588When you visit a Target product page, the browser makes an API call to /pdp_client_v1 that returns comprehensive product data in JSON format. Instead of parsing the complex HTML, we can extract this data directly from the captured API response.

How Network Capture Works

Enable network capture with render_options.capture_network_calls: true
Access captured data using the root selector followed by JSON path to network_capture
Filter by URL pattern to find the specific API call
Parse the JSON response using standard JSON extractors

Target Schema:

{
  "title": "string",
  "image_url": "string",
  "price": "number",
  "return_policy_best_guest": "string",
  "children_product_titles": ["string"]
}

Understanding network_capture Structure

The network_capture array contains all recorded API responses:

{
  "network_capture": [
    {
      "url": "https://redsky.target.com/redsky_aggregations/v1/web/pdp_client_v1?...",
      "method": "GET",
      "status": 200,
      "response_body": {
        // Full JSON response from the API
      }
    }
  ]
}

Accessing Network Capture Data

Use the root selector with JSON path to access network_capture:

{
  "type": "sequence",
  "sequence": [
    {
      "type": "root"
    },
    {
      "type": "json",
      "coercion_filter": "$.network_capture[?(@.url =~ /.*pdp_client_v1.*/)]",
      "path": "response_body.data"
    }
  ]
}

Explanation:

Root selector: Returns to the document root (escapes from HTML context)
JSON selector: Navigates into the network_capture array
coercion_filter: Uses JSONPath to filter API calls by URL pattern
- $.network_capture[?(...)] iterates through captured requests
- @.url =~ /.*pdp_client_v1.*/ matches URLs containing “pdp_client_v1”
path: Navigates into the response body to the data field

Field-by-Field Breakdown

Show Parsing: title

The product title is located at the top level of the product data.API Response Structure:

{
  "data": {
    "product": {
      "item": {
        "product_description": {
          "title": "Apple Watch Series 9 GPS 45mm Midnight Aluminum Case..."
        }
      }
    }
  }
}

Parser:

{
  "type": "terminal",
  "description": "Product title from API response",
  "selector": {
    "type": "sequence",
    "sequence": [
      {
        "type": "root"
      },
      {
        "type": "json",
        "coercion_filter": "$.network_capture[?(@.url =~ /.*pdp_client_v1.*/)]",
        "path": "response_body.data.product.item.product_description.title"
      }
    ]
  },
  "extractor": {
    "type": "raw"
  }
}

Explanation:

Sequence selector: First goes to root, then navigates into network_capture
coercion_filter: Finds the specific API call containing product data
path: Full JSONPath to the title field in the nested response
raw extractor: Returns the string value as-is

Show Parsing: image_url

The primary product image URL is nested within the enrichment data.API Response Structure:

{
  "data": {
    "product": {
      "item": {
        "enrichment": {
          "images": {
            "primary_image_url": "https://target.scene7.com/is/image/Target/..."
          }
        }
      }
    }
  }
}

Parser:

{
  "type": "terminal",
  "description": "Primary product image URL",
  "selector": {
    "type": "sequence",
    "sequence": [
      {
        "type": "root"
      },
      {
        "type": "json",
        "coercion_filter": "$.network_capture[?(@.url =~ /.*pdp_client_v1.*/)]",
        "path": "response_body.data.product.item.enrichment.images.primary_image_url"
      }
    ]
  },
  "extractor": {
    "type": "raw",
    "post_processor": {
      "type": "url"
    }
  }
}

Explanation:

JSONPath: Navigate deep into the nested structure to find the image URL
url post-processor: Ensures the URL is properly formatted and absolute

Show Parsing: price

The current price is in the formatted price field of the product.API Response Structure:

{
  "data": {
    "product": {
      "price": {
        "formatted_current_price": "$399.00"
      }
    }
  }
}

Parser:

{
  "type": "terminal",
  "description": "Current product price as number",
  "selector": {
    "type": "sequence",
    "sequence": [
      {
        "type": "root"
      },
      {
        "type": "json",
        "coercion_filter": "$.network_capture[?(@.url =~ /.*pdp_client_v1.*/)]",
        "path": "response_body.data.product.price.formatted_current_price"
      }
    ]
  },
  "extractor": {
    "type": "raw",
    "post_processor": {
      "type": "sequence",
      "sequence": [
        {
          "type": "regex",
          "regex": "[\\d,]+\\.\\d+"
        },
        {
          "type": "number"
        }
      ]
    }
  }
}

Explanation:

JSONPath: Navigate to the formatted_current_price field
regex post-processor: Extract numeric value from formatted string “$399.00” → “399.00”
number post-processor: Convert string to actual number type: “399.00” → 399.00
sequence: Chain post-processors for multi-step transformation

Show Parsing: return_policy_best_guest (with JSONPath filtering)

The return policy is buried within an array of bullet points. We need to filter to find the specific bullet containing return information.API Response Structure:

{
  "data": {
    "product": {
      "item": {
        "product_description": {
          "bullet_descriptions": [
            "<B>Returns:</B> This item must be returned within 30 days...",
            "<B>Packaging:</B> Shows what's inside...",
            "<B>Warranty:</B> 1 Year Limited Warranty..."
          ]
        }
      }
    }
  }
}

Parser:

{
  "type": "terminal",
  "description": "Return policy extracted from bullet points",
  "selector": {
    "type": "sequence",
    "sequence": [
      {
        "type": "root"
      },
      {
        "type": "json",
        "coercion_filter": "$.network_capture[?(@.url =~ /.*pdp_client_v1.*/)]",
        "path": "response_body.data.product.item.product_description.bullet_descriptions[?(@=~/.*Returns:.*/)]"
      }
    ]
  },
  "extractor": {
    "type": "raw",
    "post_processor": {
      "type": "regex",
      "regex": "<B>Returns:</B>\\s*(.*)"
    }
  }
}

Explanation:

JSONPath with filter: bullet_descriptions[?(@=~/.*Returns:.*/)] filters array elements
- [?(...)] is a JSONPath filter predicate
- @ represents the current array element
- =~ is the regex match operator
- /.*Returns:.*/ matches strings containing “Returns:”
regex post-processor: Extracts text after the “Returns:” label
- <B>Returns:</B>\\s*(.*) captures everything after the label
- Returns only the policy text, not the HTML tags

JSONPath Filtering: Use [?(@=~/pattern/)] to filter arrays by regex pattern, or [?(@.field=='value')] to filter by field value.

Show Parsing: children_product_titles (extracting from array)

Extract titles of all child products (variations) from a nested array structure.API Response Structure:

{
  "data": {
    "product": {
      "children": [
        {
          "item": {
            "product_description": {
              "title": "Apple Watch Series 9 GPS 41mm Midnight Aluminum Case..."
            }
          }
        },
        {
          "item": {
            "product_description": {
              "title": "Apple Watch Series 9 GPS 45mm Starlight Aluminum Case..."
            }
          }
        }
      ]
    }
  }
}

Parser:

{
  "type": "terminal_list",
  "description": "Titles of all product variations",
  "selector": {
    "type": "sequence",
    "sequence": [
      {
        "type": "root"
      },
      {
        "type": "json",
        "coercion_filter": "$.network_capture[?(@.url =~ /.*pdp_client_v1.*/)]",
        "path": "response_body.data.product.children[*].item.product_description"
      }
    ]
  },
  "extractor": {
    "type": "json",
    "path": "title"
  }
}

Explanation:

terminal_list: Returns an array of values instead of a single value
JSONPath: children[*] iterates over all elements in the children array
- [*] expands to each child product
- Continues navigation: .item.product_description
json extractor: From each product_description object, extract the title field
Result: Array of all child product titles

Array Iteration: The [*] notation in JSONPath makes the selector return multiple elements, which terminal_list collects into an array.

Complete Parser

{
  "type": "schema",
  "description": "Parses Target product from network API call",
  "fields": {
    "title": {
      "type": "terminal",
      "description": "Product title",
      "selector": {
        "type": "sequence",
        "sequence": [
          {
            "type": "root"
          },
          {
            "type": "json",
            "coercion_filter": "$.network_capture[?(@.url =~ /.*pdp_client_v1.*/)]",
            "path": "response_body.data.product.item.product_description.title"
          }
        ]
      },
      "extractor": {
        "type": "raw"
      }
    },
    "image_url": {
      "type": "terminal",
      "description": "Primary product image",
      "selector": {
        "type": "sequence",
        "sequence": [
          {
            "type": "root"
          },
          {
            "type": "json",
            "coercion_filter": "$.network_capture[?(@.url =~ /.*pdp_client_v1.*/)]",
            "path": "response_body.data.product.item.enrichment.images.primary_image_url"
          }
        ]
      },
      "extractor": {
        "type": "raw",
        "post_processor": {
          "type": "url"
        }
      }
    },
    "price": {
      "type": "terminal",
      "description": "Current price",
      "selector": {
        "type": "sequence",
        "sequence": [
          {
            "type": "root"
          },
          {
            "type": "json",
            "coercion_filter": "$.network_capture[?(@.url =~ /.*pdp_client_v1.*/)]",
            "path": "response_body.data.product.price.formatted_current_price"
          }
        ]
      },
      "extractor": {
        "type": "raw",
        "post_processor": {
          "type": "sequence",
          "sequence": [
            {
              "type": "regex",
              "regex": "[\\d,]+\\.\\d+"
            },
            {
              "type": "number"
            }
          ]
        }
      }
    },
    "return_policy_best_guest": {
      "type": "terminal",
      "description": "Return policy text",
      "selector": {
        "type": "sequence",
        "sequence": [
          {
            "type": "root"
          },
          {
            "type": "json",
            "coercion_filter": "$.network_capture[?(@.url =~ /.*pdp_client_v1.*/)]",
            "path": "response_body.data.product.item.product_description.bullet_descriptions[?(@=~/.*Returns:.*/)]"
          }
        ]
      },
      "extractor": {
        "type": "raw",
        "post_processor": {
          "type": "regex",
          "regex": "<B>Returns:</B>\\s*(.*)",
          "group": 1
        }
      }
    },
    "children_product_titles": {
      "type": "terminal_list",
      "description": "Titles of product variations",
      "selector": {
        "type": "sequence",
        "sequence": [
          {
            "type": "root"
          },
          {
            "type": "json",
            "coercion_filter": "$.network_capture[?(@.url =~ /.*pdp_client_v1.*/)]",
            "path": "response_body.data.product.children[*].item.product_description"
          }
        ]
      },
      "extractor": {
        "type": "json",
        "path": "title"
      }
    }
  }
}

Example Output

{
  "title": "Apple Watch Series 9 GPS 45mm Midnight Aluminum Case with Midnight Sport Band - M/L",
  "image_url": "https://target.scene7.com/is/image/Target/GUEST_3ad473cc-8f21-44c8-85ca-b9a1dee1806c",
  "price": 399.0,
  "return_policy_best_guest": "This item must be returned within 30 days of the date it was purchased in store, shipped, delivered by a Shipt shopper, or made ready for pickup.",
  "children_product_titles": [
    "Apple Watch Series 9 GPS 41mm Midnight Aluminum Case with Midnight Sport Band - S/M",
    "Apple Watch Series 9 GPS 41mm Midnight Aluminum Case with Midnight Sport Band - M/L",
    "Apple Watch Series 9 GPS 45mm Starlight Aluminum Case with Starlight Sport Band - S/M",
    "Apple Watch Series 9 GPS 45mm Starlight Aluminum Case with Starlight Sport Band - M/L"
  ]
}

API Request Example

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.extract(
    url="https://www.target.com/p/A-87562588",
    parse=True,
    render=True,
    render_options={
        "capture_network_calls": True
    },
    parser={
        "type": "schema",
        "description": "Parses Target product from network API call",
        "fields": {
            "title": {
                "type": "terminal",
                "selector": {
                    "type": "sequence",
                    "sequence": [
                        {
                            "type": "root"
                        },
                        {
                            "type": "json",
                            "coercion_filter": "$.network_capture[?(@.url =~ /.*pdp_client_v1.*/)]",
                            "path": "response_body.data.product.item.product_description.title"
                        }
                    ]
                },
                "extractor": {
                    "type": "raw"
                }
            },
            # ... (other fields as shown above)
        }
    }
)

print(result)

Important: You must set render: true and render_options.capture_network_calls: true to enable network capture. Without these settings, the network_capture field will not be available.

Key Takeaways

Advantages of Parsing Network Calls:

Cleaner data: API responses contain structured JSON, not cluttered HTML
More complete: Backend APIs often return data not visible in the rendered page
More stable: API response structures change less frequently than HTML layouts
Better performance: Parse smaller JSON payloads instead of large HTML documents
Access to internal data: Capture data from authenticated or dynamic API endpoints

Network Capture Best Practices:

URL filtering: Use regex patterns in coercion_filter to find specific API calls
- $.network_capture[?(@.url =~ /.*api_endpoint.*/)]
Method filtering: Filter by HTTP method if needed
- $.network_capture[?(@.method=='POST')]
Multiple filters: Combine conditions with &&
- $.network_capture[?(@.url =~ /.*products.*/ && @.status==200)]
Inspect first: Check browser DevTools Network tab to identify API endpoints
Test JSONPath: Use online JSONPath evaluators to test your path expressions

When to Use Network Capture:

Single-page applications (SPAs) that load data via JavaScript
Pages with lazy-loaded or infinite scroll content
E-commerce sites with dynamic pricing and inventory
Social media feeds and comment sections
Any page where data is fetched from GraphQL or REST APIs

Best Practices

Use specific selectors

Prefer specific selectors over generic ones:

// ✅ Good
{ "type": "css", "css_selector": ".product-card .price-value" }

// ❌ Avoid
{ "type": "css", "css_selector": ".price" }

Leverage fallback logic with `or` parser

Handle page variations gracefully:

{
  "type": "or",
  "parsers": [
    {
      "type": "terminal",
      "selector": { "type": "css", "css_selector": ".new-layout" },
      "extractor": { "type": "text" }
    },
    {
      "type": "terminal",
      "selector": { "type": "css", "css_selector": ".old-layout" },
      "extractor": { "type": "text" }
    }
  ]
}

Chain post processors for complex transformations

Use sequence post processor for multi-step transformations:

{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "sequence",
      "sequence": [
        { "type": "regex", "regex": "\\d+\\.\\d+" },
        { "type": "number" },
        { "type": "format", "format": "${data} USD" }
      ]
    }
  }
}

Add descriptions for documentation

{
  "type": "terminal",
  "description": "Extracts product price and converts to number",
  "selector": { ... },
  "extractor": { ... }
}

Use relative paths in nested contexts

When working within a nested selector, use relative paths:

{
  "type": "schema",
  "selector": {
    "type": "css",
    "css_selector": ".product-card"
  },
  "fields": {
    "name": {
      "type": "terminal",
      "selector": {
        "type": "css",
        "css_selector": ".name" // Relative to .product-card
      },
      "extractor": { "type": "text" }
    }
  }
}

Introduction

Web Tools

Agentic

SDKs

Guides

Admin

​When to use

​Parameters

​Usage

​Example Output

​Parser Types

​terminal

​terminal_list

​schema

​schema_list

​or

​and

​const

​Parsing Selectors

​css

​xpath

​json

​sequence

​parent

​root

​Parsing Extractors

​text

​attr

​json

​raw

​Parsing Post Processors

​url

​regex

​format

​date

​boolean

​number

​country

​sequence

​Complete Examples

​Parsing a BBC News Article

​Field-by-Field Breakdown

​Complete Parser

​Example Output

​API Request Example

​Parsing Embedded JSON from Etsy.com Prodcut Page

​The Embedded JSON

​Selecting the JSON

​Field-by-Field Breakdown

​Complete Parser

​Example Output

​API Request Example

​Key Takeaways

​Parsing XML Document with XPath

​Parsing an RSS Feed

​Parsing XML Sitemaps with Namespaces

​Parsing XML with Filtering (Book Catalog)

​Common XPath Patterns

​API Request Example

​Best Practices for XML Parsing

​Combining XPath with Other Selectors

​Parsing Network API Calls from Target.com Product Page

​How Network Capture Works

​Understanding network_capture Structure

​Accessing Network Capture Data

​Field-by-Field Breakdown

​Complete Parser

​Example Output

​API Request Example

​Key Takeaways

​Best Practices

​Use specific selectors

​Leverage fallback logic with or parser

​Chain post processors for complex transformations

​Add descriptions for documentation

​Use relative paths in nested contexts

When to use

Parameters

Usage

Example Output

Parser Types

terminal

terminal_list

schema

schema_list

or

and

const

Parsing Selectors

css

xpath

json

sequence

parent

root

Parsing Extractors

text

attr

json

raw

Parsing Post Processors

url

regex

format

date

boolean

number

country

sequence

Complete Examples

Parsing a BBC News Article

Field-by-Field Breakdown

Complete Parser

Example Output

API Request Example

Parsing Embedded JSON from Etsy.com Prodcut Page

The Embedded JSON

Selecting the JSON

Field-by-Field Breakdown

Complete Parser

Example Output

API Request Example

Key Takeaways

Parsing XML Document with XPath

Parsing an RSS Feed

Parsing XML Sitemaps with Namespaces

Parsing XML with Filtering (Book Catalog)

Common XPath Patterns

API Request Example

Best Practices for XML Parsing

Combining XPath with Other Selectors

Parsing Network API Calls from Target.com Product Page

How Network Capture Works

Understanding network_capture Structure

Accessing Network Capture Data

Field-by-Field Breakdown

Complete Parser

Example Output

API Request Example

Key Takeaways

Best Practices

Use specific selectors

Leverage fallback logic with `or` parser

Chain post processors for complex transformations

Add descriptions for documentation

Use relative paths in nested contexts