Skip to main content
Parsing Schema gives you full control over data extraction using a comprehensive parser syntax. Define exact data structures for predictable, low-cost extraction from HTML, JSON, XML, and network captures. Parsers are the complete recipe for processing web content into structured data. They combine:
  • selectors - identify elements
  • extractors - extract data from elements
  • post-processors (optional) - transform the output

When to use

Use parsing schema when you need:
  • Predictable extraction: Same selectors extract same data every time
  • Full control: Specify exact selectors paths and data types
  • High volume: Process large datasets efficiently
Parsers may break when page structure or selectors change. Monitor source pages and update parsers as needed.

Parameters

parse
boolean
default:"false"
required
Must be set to true to enable parsing. This tells Nimble you want to extract structured data using the parser you define.When disabled, you’ll just get raw HTML without structured extraction.
Example:
"parse": true
parser
object
Your custom extraction recipe that defines exactly what data to pull from the page and how to structure it.Parser structure:Each field in your parser is a key-value pair where:
  • Key - The name of the field in your output (like "product_name" or "price")
  • Value - An object that describes how to extract that field
Every parser needs:
  1. type - What kind of parser to use
    • terminal - Extract a single value (like one price)
    • terminal_list - Extract multiple values (like a list of image URLs)
    • schema - Extract a nested object (like product details)
    • schema_list - Extract a list of objects (like multiple products)
    • or - Try multiple strategies, use first one that works
    • and - Combine multiple extraction strategies
    • const - Return a fixed value
  2. selector - How to find the element on the page
    • Use CSS selectors (.product-name, #price, etc.)
    • Or XPath, JSON paths for other data types
  3. extractor - What data to grab from the element
    • text - The text content
    • attr - An attribute value (like href or src)
    • json - Parse JSON data
    • raw - The raw HTML
  4. post_processor (optional) - Transform the data
    • Convert to number, format dates, clean text, etc.
Think of it as: “Find THIS element, grab THIS data from it, and format it like THIS”
Simple example:
"parser": {
  "title": {
    "type": "terminal",
    "selector": {
      "type": "css",
      "css_selector": "h1.product-title"
    },
    "extractor": {
      "type": "text"
    }
  }
}
List example:
"parser": {
  "images": {
    "type": "terminal_list",
    "selector": {
      "type": "css",
      "css_selector": ".product-gallery img"
    },
    "extractor": {
      "type": "attr",
      "attr": "src"
    }
  }
}

Usage

Example parser structure:
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR-API-KEY")

result = nimble.extract(
    url="https://www.example.com/product",
    parse=True,
    parser={
        "product_name": {  # key - output field name
            "type": "terminal",  # parser type
            "selector": {  # how to find the element
                "type": "css",
                "css_selector": ".product-title"
            },
            "extractor": {  # what to extract
                "type": "text"
            }
        },
        "price": {
            "type": "terminal",
            "selector": {
                "type": "css",
                "css_selector": ".price-value"
            },
            "extractor": {
                "type": "text",
                "post_processor": {  # optional: transform the data
                    "type": "number"
                }
            }
        }
    }
)

print(result)

Example Output

{
  "status": "success",
  "data": {
    "parsing": {
      "product_name": "Wireless Headphones",
      "price": 79.99
    }
  }
}

Parser Types

Supported parsing types:
  • terminal - Returns a single terminal/literal as output.
  • terminal_list - Returns a list of literals instead of a single literal.
  • schema - Returns a dictionary/JSON according to its field parsers.
  • schema_list - Returns a list of dictionaries/JSONs instead of a single dictionary/JSON.
  • or - Tries a sequence of parsers and returns the result of the first parser that returns a non-null value
  • and - Runs a sequence of schema parsers and merges their results into a single output. All parsers execute on the same input, and results are combined (first non-null value wins for overlapping keys).
  • const - Always returns its value regardless of the input. Useful for adding static data to your output.

terminal

  • The most basic parser.
{
  "type": "terminal",
  "selector": { ... },
  "extractor": { ... }  // Defaults to raw extractor if not specified
}

terminal_list

Returns a list of literals instead of a single literal.
{
  "type": "terminal_list",
  "selector": { ... },
  "extractor": { ... }  // Defaults to raw extractor if not specified
}

schema

This is the most commonly used parser for structured data extraction.
{
  "type": "schema",
  "selector": { ... },  // Optional
  "fields": {
    "field_name": { /* Parser */ }
  }
}

schema_list

Returns a list of dictionaries/JSONs instead of a single dictionary/JSON.
{
  "type": "schema_list",
  "selector": { ... },  // Optional
  "fields": {
    "field_name": { /* Parser */ }
  }
}

or

Useful for handling variations in page structure.
{
  "type": "or",
  "parsers": [
    {/* Parser 1 */},
    {/* Parser 2 */},
    {/* Parser 3 */}
  ]
}

and

Runs a sequence of schema parsers and merges their results into a single output. All parsers execute on the same input, and results are combined (first non-null value wins for overlapping keys).
{
  "type": "and",
  "parsers": [
    {/* Schema Parser 1 */},
    {/* Schema Parser 2 */}
  ]
}

const

Always returns its value regardless of the input. Useful for adding static data to your output.
{
  "type": "const",
  "value": "some_value"
}

Parsing Selectors

Selectors identify elements (HTML, JSON, XML, Network) in the input web page. Supported selectors:
  • css - Selects elements matching a CSS selector.
  • xpath - Enables powerful element selection using XPath expressions. Particularly useful for XML documents like RSS feeds and sitemaps.
  • json - Extracts JSON elements from the page. All subsequent selectors and extractors receive JSON instead of HTML.
  • sequence - Combines multiple selectors in sequence. Useful for chaining different selector types.
  • parent - Traverses up the DOM tree (for HTML) or context hierarchy (for JSON). Useful when you need to select a parent element after finding a specific child.
  • root - Returns the original page (document). Often used with JSON selector to access fields like network_capture, url, or html.

css

Selects elements matching a CSS selector.
{
  "type": "css",
  "css_selector": "div.price"
}

xpath

Enables powerful element selection using XPath expressions. Particularly useful for XML documents like RSS feeds and sitemaps.
{
  "type": "xpath",
  "path": "//book[@category='fiction']"
}

json

Extracts JSON elements from the page. All subsequent selectors and extractors receive JSON instead of HTML.
{
  "type": "json",
  "path": "nested.keys.in.the.json" // jsonpath
}

sequence

Combines multiple selectors in sequence. Useful for chaining different selector types.
{
  "type": "sequence",
  "sequence": [
    {/* Selector 1 */},
    {/* Selector 2 */},
    {/* Selector 3 */}
  ]
}

parent

Traverses up the DOM tree (for HTML) or context hierarchy (for JSON). Useful when you need to select a parent element after finding a specific child.
{
  "type": "parent",
  "times": 1 // Number of levels to traverse (default: 1)
}

root

Returns the original page (document). Often used with JSON selector to access fields like network_capture, url, or html.
{
  "type": "root"
}

Parsing Extractors

Extractors specify what data to extract from the selected element. Supported extractors:
  • text - Extracts the text content of the element. Works with both HTML (CSS selectors) and XML (XPath selectors).
    • strip (optional, boolean) - If it is set to false, leading and trailing whitespaces are preserved in the text. Default is true.
    • separator (optional, string) - Specifies a separator string to use when joining text from different child elements. When extracting text from nested HTML elements, this separator will be inserted between text from different elements. If not specified, text from different elements is concatenated without a separator.
  • attr - Extracts an attribute value from the element. Works with both HTML and XML elements. common attr:
    • href - Links
    • src - Images, scripts
    • data-* - Custom data attributes
    • class - CSS classes
    • id - Element IDs
  • json - Extracts JSON content using JSONPath.
  • raw - Extracts an element as-is without coercion. JSON stays as JSON, strings stay as strings. Useful for advanced parsing with complex JSON selectors.
If no extractor is specified, the raw extractor is used by default.

text

Extracts the text content of the element. This extractor works with both HTML elements (from CSS selectors) and XML elements (from XPath selectors). You can use strip=false to keep leading and trailing whitespace characters. The default is to remove them.
The text extractor supports both HTML elements (BeautifulSoup Tag) from CSS selectors and XML elements (lxml Element) from XPath selectors. This allows you to use the same extractor regardless of whether you’re parsing HTML or XML documents.
Basic usage:
{
  "type": "text",
  "regex": "string" // Optional (deprecated), will return 1st match
}

attr

Extracts an attribute value from the element. Works with both HTML and XML elements. Common attributes: href ,src ,data-* ,class , id
{
  "type": "attr",
  "attr": "href"
}

json

Extracts JSON content using JSONPath.
{
  "type": "json",
  "path": "nested.keys.in.the.json"
}

raw

Extracts an element as-is without coercion. JSON stays as JSON, strings stay as strings. Useful for advanced parsing with complex JSON selectors.
{
  "type": "raw"
}

Parsing Post Processors

Post processors transform extractor output. Define them in the extractor’s post_processor field. Supported post-procession options:
  • url - Converts relative URLs to absolute URLs based on the page origin.
  • regex - Transforms output using a regular expression. The optional group parameter extracts a specific capturing group (defaults to 0).
  • format - Formats input into a string using Python’s str.format, where the input is available as {data}.
  • date - Formats dates to ISO format or custom format.
  • boolean - Transforms output to boolean based on conditions: contains, exists, or regex. Use not: true to reverse the result.
  • number - Coerces output to a number (int or float). Handles formatted numbers like "1.5M"1500000 or "2,100"2100.
  • country - Converts country names to country codes.
  • sequence - Applies multiple post processors in sequence.

url

Converts relative URLs to absolute URLs based on the page origin.
{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "url"
    }
  }
}

regex

Transforms output using a regular expression. The optional group parameter extracts a specific capturing group (defaults to 0).
{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "regex",
      "regex": "\\d+",
      "group": 0 // Optional
    }
  }
}

format

Formats input into a string using Python’s str.format, where the input is available as {data}.
{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "format",
      "format": "${data}"
    }
  }
}

date

Formats dates to ISO format or custom format.
{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "date",
      "format": "%d/%m/%y" // Optional, defaults to ISO format
    }
  }
}

boolean

Transforms output to boolean based on conditions: contains, exists, or regex. Use not: true to reverse the result.

number

Coerces output to a number (int or float). Handles formatted numbers like "1.5M"1500000 or "2,100"2100.
{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "number",
      "locale": "en", // Optional: locale for number parsing
      "force_type": "float" // Optional: "int" or "float"
    }
  }
}

country

Converts country names to country codes.
{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "country"
    }
  }
}

sequence

Applies multiple post processors in sequence.
{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "sequence",
      "sequence": [
        {
          "type": "regex",
          "regex": "\\d+\\.\\d+"
        },
        {
          "type": "number"
        }
      ]
    }
  }
}

Complete Examples

Parsing a BBC News Article

This example demonstrates parsing a complete BBC news article about a three-legged cat, showing how to extract structured data from HTML using various parser types, selectors, and extractors.

Parsing Embedded JSON from Etsy.com Prodcut Page

This example demonstrates parsing structured data from embedded JSON-LD (Linked Data JSON) within an HTML page. Many websites embed JSON-LD in their HTML to help search engines understand their content - we can leverage this for easier, more reliable parsing.
What is LD+JSON? Linked Data JSON is a format for structuring data in a machine-readable way. It’s often embedded in webpages using <script type="application/ld+json"> tags to provide search engines with detailed information about products, articles, events, and more.

Parsing XML Document with XPath

XPath provides a powerful query language for parsing XML documents like RSS feeds, sitemaps, product catalogs, and other structured XML data. Unlike CSS selectors designed for HTML, XPath is specifically built for XML navigation.
When to use XPath:
  • Parsing RSS/Atom feeds
  • Extracting data from XML sitemaps
  • Processing XML APIs and data exports
  • Handling namespaced XML documents

Parsing Network API Calls from Target.com Product Page

Modern web applications often load data dynamically through API calls rather than embedding it directly in HTML. Nimble’s network capture feature records these API responses, allowing you to parse structured JSON data directly from backend endpoints - often cleaner and more reliable than parsing the rendered HTML.
What is Network Capture?Network capture records API calls made by the browser while loading a page. This gives you access to the raw JSON responses from backend services, which often contain more complete data than what’s visible in the HTML.

Best Practices

Use specific selectors

Prefer specific selectors over generic ones:
// ✅ Good
{ "type": "css", "css_selector": ".product-card .price-value" }

// ❌ Avoid
{ "type": "css", "css_selector": ".price" }

Leverage fallback logic with or parser

Handle page variations gracefully:
{
  "type": "or",
  "parsers": [
    {
      "type": "terminal",
      "selector": { "type": "css", "css_selector": ".new-layout" },
      "extractor": { "type": "text" }
    },
    {
      "type": "terminal",
      "selector": { "type": "css", "css_selector": ".old-layout" },
      "extractor": { "type": "text" }
    }
  ]
}

Chain post processors for complex transformations

Use sequence post processor for multi-step transformations:
{
  "extractor": {
    "type": "text",
    "post_processor": {
      "type": "sequence",
      "sequence": [
        { "type": "regex", "regex": "\\d+\\.\\d+" },
        { "type": "number" },
        { "type": "format", "format": "${data} USD" }
      ]
    }
  }
}

Add descriptions for documentation

{
  "type": "terminal",
  "description": "Extracts product price and converts to number",
  "selector": { ... },
  "extractor": { ... }
}

Use relative paths in nested contexts

When working within a nested selector, use relative paths:
{
  "type": "schema",
  "selector": {
    "type": "css",
    "css_selector": ".product-card"
  },
  "fields": {
    "name": {
      "type": "terminal",
      "selector": {
        "type": "css",
        "css_selector": ".name" // Relative to .product-card
      },
      "extractor": { "type": "text" }
    }
  }
}