> ## Documentation Index > Fetch the complete documentation index at: https://docs.nimbleway.com/llms.txt > Use this file to discover all available pages before exploring further. # Parsing Schema > Precise data extraction with powerful parser syntax Parsing Schema gives you full control over data extraction using a comprehensive parser syntax. Define exact data structures for predictable, low-cost extraction from HTML, JSON, XML, and network captures. Parsers are the complete recipe for processing web content into structured data. They combine: * selectors - identify elements * extractors - extract data from elements * post-processors (optional) - transform the output # **When to use** Use parsing schema when you need: * **Predictable extraction**: Same selectors extract same data every time * **Full control**: Specify exact selectors paths and data types * **High volume**: Process large datasets efficiently Parsers may break when page structure or selectors change. Monitor source pages and update parsers as needed. # **Parameters** **`Must be set to true`** to enable parsing. This tells Nimble you want to extract structured data using the parser you define. When disabled, you'll just get raw HTML without structured extraction. **Example:** ```json theme={"system"} "parse": true ``` Your custom extraction recipe that defines exactly what data to pull from the page and how to structure it. **Parser structure:** Each field in your parser is a key-value pair where: * **Key** - The name of the field in your output (like `"product_name"` or `"price"`) * **Value** - An object that describes how to extract that field **Every parser needs:** 1. `type` - What kind of parser to use * `terminal` - Extract a single value (like one price) * `terminal_list` - Extract multiple values (like a list of image URLs) * `schema` - Extract a nested object (like product details) * `schema_list` - Extract a list of objects (like multiple products) * `or` - Try multiple strategies, use first one that works * `and` - Combine multiple extraction strategies * `const` - Return a fixed value 2. `selector` - How to find the element on the page * Use CSS selectors (`.product-name`, `#price`, etc.) * Or XPath, JSON paths for other data types 3. `extractor` - What data to grab from the element * `text` - The text content * `attr` - An attribute value (like `href` or `src`) * `json` - Parse JSON data * `raw` - The raw HTML 4. `post_processor` (optional) - Transform the data * Convert to number, format dates, clean text, etc. **Think of it as:** "Find THIS element, grab THIS data from it, and format it like THIS" **Simple example:** ```json theme={"system"} "parser": { "title": { "type": "terminal", "selector": { "type": "css", "css_selector": "h1.product-title" }, "extractor": { "type": "text" } } } ``` **List example:** ```json theme={"system"} "parser": { "images": { "type": "terminal_list", "selector": { "type": "css", "css_selector": ".product-gallery img" }, "extractor": { "type": "attr", "attr": "src" } } } ``` ### Usage **Example parser structure:** ```python Python theme={"system"} from nimble_python import Nimble nimble = Nimble(api_key="YOUR-API-KEY") result = nimble.extract( url="https://www.example.com/product", parse=True, parser={ "product_name": { # key - output field name "type": "terminal", # parser type "selector": { # how to find the element "type": "css", "css_selector": ".product-title" }, "extractor": { # what to extract "type": "text" } }, "price": { "type": "terminal", "selector": { "type": "css", "css_selector": ".price-value" }, "extractor": { "type": "text", "post_processor": { # optional: transform the data "type": "number" } } } } ) print(result) ``` ```javascript Node theme={"system"} import Nimble from "@nimble-way/nimble-js"; const nimble = new Nimble({ apiKey: "YOUR-API-KEY" }); const result = await nimble.extract({ url: "https://www.example.com/product", parse: true, parser: { product_name: { // key - output field name type: "terminal", // parser type selector: { // how to find the element type: "css", css_selector: ".product-title", }, extractor: { // what to extract type: "text", }, }, price: { type: "terminal", selector: { type: "css", css_selector: ".price-value", }, extractor: { type: "text", post_processor: { // optional: transform the data type: "number", }, }, }, }, }); console.log(result); ``` ```bash cURL theme={"system"} curl -X POST 'https://sdk.nimbleway.com/v1/extract' \ --header 'Authorization: Bearer ' \ --header 'Content-Type: application/json' \ --data-raw '{ "url": "https://www.example.com/product", "parse": true, "parser": { "product_name": { "type": "terminal", "selector": { "type": "css", "css_selector": ".product-title" }, "extractor": { "type": "text" } }, "price": { "type": "terminal", "selector": { "type": "css", "css_selector": ".price-value" }, "extractor": { "type": "text", "post_processor": { "type": "number" } } } } }' ``` ### **Example Output** ```json theme={"system"} { "status": "success", "data": { "parsing": { "product_name": "Wireless Headphones", "price": 79.99 } } } ``` ## Parser Types Supported parsing types: * `terminal` - Returns a single terminal/literal as output. * `terminal_list` - Returns a list of literals instead of a single literal. * `schema` - Returns a dictionary/JSON according to its field parsers. * `schema_list` - Returns a list of dictionaries/JSONs instead of a single dictionary/JSON. * `or` - Tries a sequence of parsers and returns the result of the first parser that returns a non-null value * `and` - Runs a sequence of schema parsers and merges their results into a single output. All parsers execute on the same input, and results are combined (first non-null value wins for overlapping keys). * `const` - Always returns its `value` regardless of the input. Useful for adding static data to your output. ### terminal * The most basic parser. ```json theme={"system"} { "type": "terminal", "selector": { ... }, "extractor": { ... } // Defaults to raw extractor if not specified } ``` ```json theme={"system"} { "type": "terminal", "selector": { "type": "css", "css_selector": ".product-name" }, "extractor": { "type": "text" } } ``` ### terminal\_list Returns a list of literals instead of a single literal. ```json theme={"system"} { "type": "terminal_list", "selector": { ... }, "extractor": { ... } // Defaults to raw extractor if not specified } ``` ```json theme={"system"} { "type": "terminal_list", "selector": { "type": "css", "css_selector": ".product-gallery img" }, "extractor": { "type": "attr", "attr": "src" } } ``` ### schema This is the most commonly used parser for structured data extraction. ```json theme={"system"} { "type": "schema", "selector": { ... }, // Optional "fields": { "field_name": { /* Parser */ } } } ``` ```json theme={"system"} { "type": "schema", "selector": { "type": "css", "css_selector": ".product-card" }, "fields": { "name": { "type": "terminal", "selector": { "type": "css", "css_selector": ".product-name" }, "extractor": { "type": "text" } }, "price": { "type": "terminal", "selector": { "type": "css", "css_selector": ".price" }, "extractor": { "type": "text" } } } } ``` ### schema\_list Returns a list of dictionaries/JSONs instead of a single dictionary/JSON. ```json theme={"system"} { "type": "schema_list", "selector": { ... }, // Optional "fields": { "field_name": { /* Parser */ } } } ``` The optional `position` attribute adds an index field to each item in the output list. ```json theme={"system"} { "type": "schema_list", "selector": { ... }, "fields": { "field_name": { /* Parser */ } }, "position": { "field_name": "index", "start_from": 1 // Optional, defaults to 0 } } ``` ```json theme={"system"} { "type": "schema_list", "selector": { "type": "css", "css_selector": ".product-item" }, "fields": { "name": { "type": "terminal", "selector": { "type": "css", "css_selector": ".product-name" }, "extractor": { "type": "text" } }, "price": { "type": "terminal", "selector": { "type": "css", "css_selector": ".price" }, "extractor": { "type": "text" } } } } ``` ### or Useful for handling variations in page structure. ```json theme={"system"} { "type": "or", "parsers": [ { /* Parser 1 */ }, { /* Parser 2 */ }, { /* Parser 3 */ } ] } ``` ```json theme={"system"} { "type": "or", "parsers": [ { "type": "terminal", "selector": { "type": "css", "css_selector": ".sale-price" }, "extractor": { "type": "text" } }, { "type": "terminal", "selector": { "type": "css", "css_selector": ".regular-price" }, "extractor": { "type": "text" } } ] } ``` ### and Runs a sequence of schema parsers and merges their results into a single output. All parsers execute on the same input, and results are combined (first non-null value wins for overlapping keys). ```json theme={"system"} { "type": "and", "parsers": [ { /* Schema Parser 1 */ }, { /* Schema Parser 2 */ } ] } ``` ### const Always returns its `value` regardless of the input. Useful for adding static data to your output. ```json theme={"system"} { "type": "const", "value": "some_value" } ``` ## Parsing Selectors Selectors identify elements (HTML, JSON, XML, Network) in the input web page. Supported selectors: * `css` - Selects elements matching a [CSS selector](https://www.w3schools.com/css/css_selectors.asp). * `xpath` - Enables powerful element selection using [XPath expressions](https://www.w3schools.com/xml/xpath_intro.asp). Particularly useful for XML documents like RSS feeds and sitemaps. * `json` - Extracts JSON elements from the page. All subsequent selectors and extractors receive JSON instead of HTML. * `sequence` - Combines multiple selectors in sequence. Useful for chaining different selector types. * `parent` - Traverses up the DOM tree (for HTML) or context hierarchy (for JSON). Useful when you need to select a parent element after finding a specific child. * `root` - Returns the original page (document). Often used with JSON selector to access fields like `network_capture`, `url`, or `html`. ### css Selects elements matching a [CSS selector](https://www.w3schools.com/css/css_selectors.asp). ```json theme={"system"} { "type": "css", "css_selector": "div.price" } ``` ```json theme={"system"} // Select by class { "type": "css", "css_selector": ".product-name" } // Select by ID { "type": "css", "css_selector": "#main-content" } // Select by attribute { "type": "css", "css_selector": "a[data-product-id]" } // Complex selector { "type": "css", "css_selector": "div.product > h2.title" } ``` ### xpath Enables powerful element selection using [XPath expressions](https://www.w3schools.com/xml/xpath_intro.asp). Particularly useful for XML documents like RSS feeds and sitemaps. ```json theme={"system"} { "type": "xpath", "path": "//book[@category='fiction']" } ``` * `//element` - Select all elements with the given name - `/root/child` - Select child elements of root - `//element[@attr='value']` - Select by attribute value - `//element[position()=1]` - Select first element - `//*[local-name()='element']` - Select ignoring namespaces ```json theme={"system"} { "type": "terminal_list", "selector": { "type": "xpath", "path": "//*[local-name()='loc']" }, "extractor": { "type": "text" } } ``` ```json theme={"system"} { "type": "sequence", "sequence": [ { "type": "xpath", "path": "//item" }, { "type": "xpath", "path": ".//title" } ] } ``` ### json Extracts JSON elements from the page. All subsequent selectors and extractors receive JSON instead of HTML. ```json theme={"system"} { "type": "json", "path": "nested.keys.in.the.json" // jsonpath } ``` The `coercion_filter` field provides advanced control when dealing with multiple JSON objects. It uses JSONPath expressions to filter specific JSON objects. ```json theme={"system"} { "type": "json", "coercion_filter": "$[1]", // Get the second JSON object "path": "nested.keys" } ``` ```json theme={"system"} { "type": "json", "coercion_filter": "$[?(@.type=='product')]", "path": "name" } ``` ```json theme={"system"} { "type": "terminal", "selector": { "type": "sequence", "sequence": [ { "type": "css", "css_selector": "script[type='application/ld+json']" }, { "type": "json", "path": "$.offers.price" } ] }, "extractor": { "type": "raw" } } ``` ### sequence Combines multiple selectors in sequence. Useful for chaining different selector types. ```json theme={"system"} { "type": "sequence", "sequence": [ { /* Selector 1 */ }, { /* Selector 2 */ }, { /* Selector 3 */ } ] } ``` ```json theme={"system"} { "type": "sequence", "sequence": [ { "type": "css", "css_selector": "script#product-data" }, { "type": "json", "path": "$.product" } ] } ``` ### parent Traverses up the DOM tree (for HTML) or context hierarchy (for JSON). Useful when you need to select a parent element after finding a specific child. ```json theme={"system"} { "type": "parent", "times": 1 // Number of levels to traverse (default: 1) } ``` ```json theme={"system"} { "type": "sequence", "sequence": [ { "type": "css", "css_selector": "span.price" }, { "type": "parent", "times": 2 } ] } ``` ### root Returns the original page (document). Often used with JSON selector to access fields like `network_capture`, `url`, or `html`. ```json theme={"system"} { "type": "root" } ``` ```json theme={"system"} { "type": "terminal", "selector": { "type": "sequence", "sequence": [ { "type": "root" }, { "type": "json", "path": "$.url" } ] }, "extractor": { "type": "raw" } } ``` ```json theme={"system"} { "type": "terminal", "selector": { "type": "sequence", "sequence": [ { "type": "root" }, { "type": "json", "path": "$.network_capture" } ] }, "extractor": { "type": "raw" } } ``` ## Parsing Extractors Extractors specify what data to extract from the selected element. Supported extractors: * `text` - Extracts the text content of the element. Works with both HTML (CSS selectors) and XML (XPath selectors). * `strip` (optional, boolean) - If it is set to `false`, leading and trailing whitespaces are preserved in the text. Default is `true`. * `separator` (optional, string) - Specifies a separator string to use when joining text from different child elements. When extracting text from nested HTML elements, this separator will be inserted between text from different elements. If not specified, text from different elements is concatenated without a separator. * `attr` - Extracts an attribute value from the element. Works with both HTML and XML elements. common attr: * `href` - Links * `src` - Images, scripts * `data-*` - Custom data attributes * `class` - CSS classes * `id` - Element IDs * `json` - Extracts JSON content using JSONPath. * `raw` - Extracts an element as-is without coercion. JSON stays as JSON, strings stay as strings. Useful for advanced parsing with complex JSON selectors. If no extractor is specified, the **raw** extractor is used by default. ### text Extracts the text content of the element. This extractor works with both HTML elements (from CSS selectors) and XML elements (from XPath selectors). You can use `strip=false` to keep leading and trailing whitespace characters. The default is to remove them. The text extractor supports both HTML elements (BeautifulSoup Tag) from CSS selectors and XML elements (lxml Element) from XPath selectors. This allows you to use the same extractor regardless of whether you're parsing HTML or XML documents. **Basic usage:** ```json theme={"system"} { "type": "text", "regex": "string" // Optional (deprecated), will return 1st match } ``` ```json theme={"system"} { "type": "text", "regex": "string", // Optional (deprecated), will return 1st match "strip": false } ``` ```json theme={"system"} { "type": "text", "regex": "string", // Optional (deprecated), will return 1st match "separator": " | " } ``` ```json theme={"system"} { "type": "terminal", "selector": { "type": "xpath", "path": "//book/title" }, "extractor": { "type": "text", "strip": true } } ``` ### attr Extracts an attribute value from the element. Works with both HTML and XML elements. Common attributes: `href` ,`src` ,`data-*` ,`class` , `id` ```json theme={"system"} { "type": "attr", "attr": "href" } ``` ```json theme={"system"} { "type": "terminal", "selector": { "type": "css", "css_selector": "img.product-image" }, "extractor": { "type": "attr", "attr": "src" } } ``` ### json Extracts JSON content using JSONPath. ```json theme={"system"} { "type": "json", "path": "nested.keys.in.the.json" } ``` ### raw Extracts an element as-is without coercion. JSON stays as JSON, strings stay as strings. Useful for advanced parsing with complex JSON selectors. ```json theme={"system"} { "type": "raw" } ``` ## Parsing Post Processors Post processors transform extractor output. Define them in the extractor's `post_processor` field. Supported post-procession options: * `url` - Converts relative URLs to absolute URLs based on the page origin. * `regex` - Transforms output using a [regular expression](https://regexr.com/). The optional `group` parameter extracts a specific capturing group (defaults to 0). * `format` - Formats input into a string using Python's str.format, where the input is available as `{data}`. * `date` - Formats dates to ISO format or custom format. * `boolean` - Transforms output to boolean based on conditions: `contains`, `exists`, or `regex`. Use `not: true` to reverse the result. * `number` - Coerces output to a number (int or float). Handles formatted numbers like `"1.5M"` → `1500000` or `"2,100"` → `2100`. * `country` - Converts country names to country codes. * `sequence` - Applies multiple post processors in sequence. ### url Converts relative URLs to absolute URLs based on the page origin. ```json theme={"system"} { "extractor": { "type": "text", "post_processor": { "type": "url" } } } ``` * Input: `"/news/article"` - Output: `"https://www.example.com/news/article"` ### regex Transforms output using a [regular expression](https://regexr.com/). The optional `group` parameter extracts a specific capturing group (defaults to 0). ```json theme={"system"} { "extractor": { "type": "text", "post_processor": { "type": "regex", "regex": "\\d+", "group": 0 // Optional } } } ``` ```json theme={"system"} { "extractor": { "type": "text", "post_processor": { "type": "regex", "regex": "\\d+\\.\\d+" } } } ``` ```json theme={"system"} { "extractor": { "type": "text", "post_processor": { "type": "regex", "regex": "Price: (\\d+)\\.(\\d+)", "group": 1 } } } ``` ### format Formats input into a string using Python's str.format, where the input is available as `{data}`. ```json theme={"system"} { "extractor": { "type": "text", "post_processor": { "type": "format", "format": "${data}" } } } ``` - Input: 5.00 - Output: "\$5.00" ### date Formats dates to ISO format or custom format. ```json theme={"system"} { "extractor": { "type": "text", "post_processor": { "type": "date", "format": "%d/%m/%y" // Optional, defaults to ISO format } } } ``` * Input: `"5 days ago"` - Output (no format): `"2024-07-29T00:00:00"` - Output (with format `%d/%m/%y`): `"29/07/2024"` ### boolean Transforms output to boolean based on conditions: `contains`, `exists`, or `regex`. Use `not: true` to reverse the result. ```json theme={"system"} { "extractor": { "type": "text", "post_processor": { "type": "boolean", "condition": "contains", "contains": "InStock" } } } ``` ```json theme={"system"} { "extractor": { "type": "text", "post_processor": { "type": "boolean", "condition": "exists" } } } ``` ```json theme={"system"} { "extractor": { "type": "text", "post_processor": { "type": "boolean", "condition": "regex", "regex": "\\d+" } } } ``` ```json theme={"system"} { "extractor": { "type": "text", "post_processor": { "type": "boolean", "condition": "contains", "contains": "OutOfStock", "not": true } } } ``` ### number Coerces output to a number (int or float). Handles formatted numbers like `"1.5M"` → `1500000` or `"2,100"` → `2100`. ```json theme={"system"} { "extractor": { "type": "text", "post_processor": { "type": "number", "locale": "en", // Optional: locale for number parsing "force_type": "float" // Optional: "int" or "float" } } } ``` ```json theme={"system"} { "extractor": { "type": "text", "post_processor": { "type": "number", "locale": "de" } } } ``` Input: `"1.000,50"` → Output: `1000.50` ### country Converts country names to country codes. ```json theme={"system"} { "extractor": { "type": "text", "post_processor": { "type": "country" } } } ``` * Input: `"United States"` - Output: `"US"` ### sequence Applies multiple post processors in sequence. ```json theme={"system"} { "extractor": { "type": "text", "post_processor": { "type": "sequence", "sequence": [ { "type": "regex", "regex": "\\d+\\.\\d+" }, { "type": "number" } ] } } } ``` * Input: `"The price is $50.25!"` - After regex: `"50.25"` - After number: `50.25` ## Complete Examples ### Parsing a BBC News Article This example demonstrates parsing a complete BBC news article about a three-legged cat, showing how to extract structured data from HTML using various parser types, selectors, and extractors. **Target URL:** `https://www.bbc.com/news/articles/cervlxymly2o` **Target Schema:** ```json theme={"system"} { "url": "string", "title": "string", "date": "string", "author": { "name": "string", "organization": "string" }, "images": ["string"], "paragraphs": ["string"] } ``` ### Field-by-Field Breakdown The URL is the canonical link for the page, typically found in the HTML `head` tag under a `link` element with `rel="canonical"`. **HTML Structure:** ```html theme={"system"} ``` **Parser:** ```json theme={"system"} { "type": "terminal", "description": "Extracts the canonical URL from the page's head section", "selector": { "type": "css", "css_selector": "link[rel='canonical']" }, "extractor": { "type": "attr", "attr": "href" } } ``` **Explanation:** * **Selector:** `link[rel='canonical']` selects the first link element with `rel="canonical"` * **Extractor:** `attr` with `href` extracts the URL from the href attribute The article title is contained in an `h1` element within a headline block. **HTML Structure:** ```html theme={"system"}

Three-legged cat 'brings town together'

``` **Parser:** ```json theme={"system"} { "type": "terminal", "description": "Main headline of the article", "selector": { "type": "css", "css_selector": "div[data-component='headline-block'] h1" }, "extractor": { "type": "text" } } ``` **Explanation:** * **Selector:** `div[data-component='headline-block'] h1` targets the h1 inside the headline block * **Extractor:** `text` extracts the text content from the element The publication date is in a `time` element and needs to be formatted to ISO format. **HTML Structure:** ```html theme={"system"}

29 July 2024

``` **Parser:** ```json theme={"system"} { "type": "terminal", "description": "Publication date in ISO format", "selector": { "type": "css", "css_selector": "div[data-testid='byline-new'] time" }, "extractor": { "type": "text", "post_processor": { "type": "date" } } } ``` **Explanation:** * **Selector:** `div[data-testid='byline-new'] time` targets the time element * **Extractor:** `text` with `date` post-processor converts "29 July 2024" to "2024-07-29T00:00:00" This field requires JavaScript rendering. Add render options to your API request to wait for this element. The author information contains both name and organization, requiring a nested schema parser. **HTML Structure:** ```html theme={"system"}

Martin Heath

BBC News, Northamptonshire

``` **Parser:** ```json theme={"system"} { "type": "schema", "description": "Author information with name and organization", "selector": { "type": "css", "css_selector": "div[data-testid='byline-new-contributors']" }, "fields": { "name": { "type": "terminal", "selector": { "type": "css", "css_selector": "span[class]" }, "extractor": { "type": "text" } }, "organization": { "type": "terminal", "selector": { "type": "css", "css_selector": "span:not([class])" }, "extractor": { "type": "text" } } } } ``` **Explanation:** * **Schema Parser:** Returns a nested object with multiple fields * **Selector Nesting:** Parent selector scopes child selectors to the byline-new-contributors div * **Name Selector:** `span[class]` selects spans with a class attribute * **Organization Selector:** `span:not([class])` selects spans without a class attribute Extract all image URLs from the article, converting relative URLs to absolute. **HTML Structure:** ```html theme={"system"}

``` **Parser:** ```json theme={"system"} { "type": "terminal_list", "description": "All article images with absolute URLs", "selector": { "type": "css", "css_selector": "article img" }, "extractor": { "type": "attr", "attr": "src", "post_processor": { "type": "url" } } } ``` **Explanation:** * **terminal\_list:** Returns an array of values instead of a single value * **Selector:** `article img` selects all img elements within article * **Extractor:** `attr` with `src` gets the image source * **Post-processor:** `url` converts relative URLs to absolute (e.g., `/news/...` → `https://www.bbc.com/news/...`) Extract all article paragraphs as an array of strings. **HTML Structure:** ```html theme={"system"}

A three-legged cat has captured a town's imagination...

The people of Daventry, Northamptonshire, love taking photographs...

``` **Parser:** ```json theme={"system"} { "type": "terminal_list", "description": "Article content paragraphs", "selector": { "type": "css", "css_selector": "article div[data-component='text-block'] p" }, "extractor": { "type": "text" } } ``` **Explanation:** * **terminal\_list:** Returns an array of paragraph texts * **Selector:** `article div[data-component='text-block'] p` selects all p elements in text blocks * **Extractor:** `text` extracts the text content from each paragraph ### Complete Parser ```json theme={"system"} { "type": "schema", "description": "Parses a BBC news article into structured data", "fields": { "url": { "type": "terminal", "description": "Canonical URL of the article", "selector": { "type": "css", "css_selector": "link[rel='canonical']" }, "extractor": { "type": "attr", "attr": "href" } }, "title": { "type": "terminal", "description": "Main headline of the article", "selector": { "type": "css", "css_selector": "div[data-component='headline-block'] h1" }, "extractor": { "type": "text" } }, "date": { "type": "terminal", "description": "Publication date in ISO format", "selector": { "type": "css", "css_selector": "div[data-testid='byline-new'] time" }, "extractor": { "type": "text", "post_processor": { "type": "date" } } }, "author": { "type": "schema", "description": "Author information", "selector": { "type": "css", "css_selector": "div[data-testid='byline-new-contributors']" }, "fields": { "name": { "type": "terminal", "selector": { "type": "css", "css_selector": "span[class]" }, "extractor": { "type": "text" } }, "organization": { "type": "terminal", "selector": { "type": "css", "css_selector": "span:not([class])" }, "extractor": { "type": "text" } } } }, "images": { "type": "terminal_list", "description": "All article images", "selector": { "type": "css", "css_selector": "article img" }, "extractor": { "type": "attr", "attr": "src", "post_processor": { "type": "url" } } }, "paragraphs": { "type": "terminal_list", "description": "Article content paragraphs", "selector": { "type": "css", "css_selector": "article div[data-component='text-block'] p" }, "extractor": { "type": "text" } } } } ``` ### Example Output ```json theme={"system"} { "url": "https://www.bbc.com/news/articles/cervlxymly2o", "title": "Three-legged cat 'brings town together'", "date": "2024-07-29T00:00:00", "author": { "name": "Martin Heath", "organization": "BBC News, Northamptonshire" }, "images": [ "https://ichef.bbci.co.uk/news/480/cpsprodpb/2a87/live/321fae30-4c01-11ef-b2d2-cdb23d5d7c5b.jpg.webp", "https://ichef.bbci.co.uk/news/480/cpsprodpb/a8c2/live/904194b0-4c01-11ef-b2d2-cdb23d5d7c5b.jpg.webp", "https://ichef.bbci.co.uk/news/480/cpsprodpb/7579/live/9ecae4f0-4c01-11ef-aebc-6de4d31bf5cd.jpg.webp" ], "paragraphs": [ "A three-legged cat has captured a town's imagination with his appearances in shops and offices.", "The people of Daventry, Northamptonshire, love taking photographs of the 14-year-old feline and documenting his travels on social media.", "Funds have been raised to buy a street sign with his name on it, and souvenir Salem T-shirts could follow." ] } ``` ### API Request Example ```python Python theme={"system"} from nimble_python import Nimble nimble = Nimble(api_key="YOUR-API-KEY") result = nimble.extract( url="https://www.bbc.com/news/articles/cervlxymly2o", parse=True, render=True, render_flow=[{ "wait_for": { "selectors": ["div[data-testid='byline-new'] time"] } }], parser={ "type": "schema", "description": "Parses a BBC news article into structured data", "fields": { # ... (full parser structure as shown above) } } ) print(result) ``` ```javascript Node theme={"system"} import Nimble from "@nimble-way/nimble-js"; const nimble = new Nimble({ apiKey: "YOUR-API-KEY" }); const result = await nimble.extract({ url: "https://www.bbc.com/news/articles/cervlxymly2o", parse: true, render: true, render_flow: [ { wait_for: { selectors: ["div[data-testid='byline-new'] time"], }, }, ], parser: { type: "schema", description: "Parses a BBC news article into structured data", fields: { // ... (full parser structure as shown above) }, }, }); console.log(result); ``` ```bash cURL theme={"system"} curl -X POST 'https://sdk.nimbleway.com/v1/extract' \ --header 'Authorization: Bearer ' \ --header 'Content-Type: application/json' \ --data-raw '{ "url": "https://www.bbc.com/news/articles/cervlxymly2o", "parse": true, "render": true, "render_flow": [{ "wait_for": { "selectors": ["div[data-testid=\"byline-new\"] time"] } }], "parser": { "type": "schema", "description": "Parses a BBC news article into structured data", "fields": { ... } } }' ``` This parser can be reused for any BBC news article following the same structure - just change the URL! ### Parsing Embedded JSON from Etsy.com Prodcut Page This example demonstrates parsing structured data from embedded JSON-LD (Linked Data JSON) within an HTML page. Many websites embed JSON-LD in their HTML to help search engines understand their content - we can leverage this for easier, more reliable parsing. **What is LD+JSON?** Linked Data JSON is a format for structuring data in a machine-readable way. It's often embedded in webpages using `