Parsing Templates

Nimble Labs Beta Feature

Parsing templates allow users to accurately extract specific snippets or key data points from a webpage. By using industry-standard CSS selectors, parsing templates can extract data precisely from almost any webpage.

Parsing templates also come with a variety of options designed to make data extraction a breeze, such as built-in support for tables, JSON output, and custom objects. They make use of the same framework used by other popular parsing libraries such as Beautiful Soup, making for an easy and familiar experience.

When using parsing templates, it's important to monitor changes in the source webpage structure and its effect on parsing templates. Nimble does not maintain or update custom parsing templates.

Table of contents

Quick start examples

In the following examples, we'll be using a single page through, and demonstrate the best way to go about parsing specific snippets in a real-world situation. The page we'll be using is the ESPN NBA page for the Boston Celtics, as it appeared circa June 2023. This page may have changed since the writing of this guide, and should only be used as an example.

In the following examples, we'll demonstrate how to extract the name of the team, the team standings table, and the titles and link URLs of the articles displayed on the page.

Example 1 - extracting text

We'll start off by parsing out the name of the team, as it appears in the top left of the page. If we examine the HTML surrounding the title, we see the following structure:

<h1 class="ClubhouseHeader__Name ttu flex items-start n2">
    <span class="flex flex-wrap">
        <span class="db pr3 nowrap">Boston</span>
        <span class="db fw-bold">Celtics</span>
    </span>
</h1>

The class ClubhouseHeader__Name is unique, and appears only once on this page, so we can use it to target the name of the team accurately. Although the H1 container has several spans inside, we'll be able to parse out just the contents and get the name of the team.

To get the name of the team, we'll use the following request:

curl -X POST 'https://api.webit.live/api/v1/realtime/web' \
--header 'Authorization: Basic <credential string>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "https://www.espn.com/nba/team/_/name/bos/boston-celtics",
    "parse": true,
    "format": "json",
    "render": true,
    "country": "US",
    "parser": {
        "team_name": {
            "type": "item",
            "selectors": [".ClubhouseHeader__Name"],
            "extractor": "text"
        }
    }
}'

Firstly, notice that we've set parse to true and format to json - these are required for parsing templates to work correctly.

Next, let's examine the parser setting, where the parsing template itself is defined.

On the first line, we've set the name of the first parsing template - team_name. This name will also be used in the response we get back, where this name will be attached to the parsing output. Within the team_name template, we've set three parameters:

The output for this request returned:

{
    "status": "success",
    "query_time": "2023-06-06T14:06:44.986Z",
    "status_code": 200,
    "html_content": "...",
    "headers": {
        ...
    },
    "parsing": {
        "team_name": "BostonCeltics"
    },
    "url": "https://www.espn.com/nba/team/_/name/bos/boston-celtics"
}

In this example, we defined the parsing template inline, or within the request body. However, we recommend uploading your parsing templates and simply referring to them in each call for a more smooth experience at scale.

See the Implementing Parsing Templates section further down for more information.

Example 2 - extracting tables

Having parsed the name of the team, we're now interested in parsing additional data. On the right side of the page, we can see the standings table, and wish to add that to our request. When we examine the table, we see the following HTML structure:

<section class="Card TeamStandings">
  ...
  <table style="border-collapse:collapse;border-spacing:0" class="Table Table--align-right">
    <colgroup class="Table__Colgroup">
      <col class="Table__Column">
      <col class="Table__Column">
      <col class="Table__Column">
      <col class="Table__Column">
      <col class="Table__Column">
      <col class="Table__Column">
    </colgroup>
    <thead class="Table__THEAD">
      <tr class="Table__TR Table__even">
        <th title="" class="Table__TH">Team</th>
        <th title="" class="Table__TH">W</th>
        <th title="" class="Table__TH">L</th>
        <th title="" class="Table__TH">PCT</th>
        <th title="" class="Table__TH">GB</th>
        <th title="" class="Table__TH">STRK</th>
      </tr>
    </thead>
    <tbody class="Table__TBODY">
      <tr class="Table__TR Table__TR--sm Table__even" data-idx="0">
        <td class="Table__TD">
          <a class="AnchorLink fw-bold" tabindex="0" href="/nba/team/_/name/bos/boston-celtics">Boston</a>
        </td>
        <td class="fw-bold clr-gray-01 Table__TD">
          <span class="fw-bold clr-gray-01">57</span>
        </td>
        <td class="fw-bold clr-gray-01 Table__TD">
          <span class="fw-bold clr-gray-01">25</span>
        </td>
        ...
    </tbody>
  </table>
</section>

To target this table, we'll use two selectors in conjunction. The .TeamStandings class is unique, and helps us define a narrow scope, and the table selector allows us to directly select the table within the .TeamStandings class.

To get the standings table and the team name, we'll use the following request:

curl -X POST 'https://api.webit.live/api/v1/realtime/web' \
--header 'Authorization: Basic <credential string>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "https://www.espn.com/nba/team/_/name/bos/boston-celtics",
    "parse": true,
    "format": "json",
    "render": true,
    "country": "US",
    "parser": {
        "team_name": {
            "type": "item",
            "selectors": [".ClubhouseHeader__Name"],
            "extractor": "text"
        },
        "team_standings": {
            "type": "table",
            "selectors": [".TeamStandings table"],
            "extractor": "text"
        }
    }
}'

We can use multiple parsing templates by simply defining each one with a unique name and separating them with a comma. For the team standings parsing template, we've set the following parameters:

This request returned:

{
    "status": "success",
    "query_time": "2023-06-06T14:06:44.986Z",
    "status_code": 200,
    "html_content": "...",
    "headers": {
        ...
    },
    "parsing": {
        "team_name": "BostonCeltics",
        "team_standings": [
            {
                "GB": "-",
                "L": "25",
                "PCT": ".695",
                "STRK": "W3",
                "Team": "Boston",
                "W": "57"
            },
            {
                "GB": "3",
                "L": "28",
                "PCT": ".659",
                "STRK": "W2",
                "Team": "Philadelphia",
                "W": "54"
            },
            {
                "GB": "10",
                "L": "35",
                "PCT": ".573",
                "STRK": "L2",
                "Team": "NY Knicks",
                "W": "47"
            },
            {
                "GB": "12",
                "L": "37",
                "PCT": ".549",
                "STRK": "L1",
                "Team": "Brooklyn",
                "W": "45"
            },
            {
                "GB": "16",
                "L": "41",
                "PCT": ".500",
                "STRK": "W1",
                "Team": "Toronto",
                "W": "41"
            }
        ]
    },
    "url": "https://www.espn.com/nba/team/_/name/bos/boston-celtics"
}

In this example, we defined the parsing template inline, or within the request body. However, we recommend uploading your parsing templates and simply referring to them in each call for a more smooth experience at scale.

See the Implementing Parsing Templates section further down for more information.

Example 3 - extracting repeating elements

Our example page contains many articles that are relevant to the Boston Celtics, appearing in the center of the page. We might be interested in parsing out these articles. Doing each one individually would take a lot of time, and would be prone to breaks as new articles are published.

This is where lists come in. Using list instead of item as our extractor allows us to get not just the first, but all the matched elements that appear for a selector. When we examine the HTML of the page around an article, we see the following structure:

<section>
  <article class="contentItem cf relative overflow-hidden mb3 br-5 overflow-hidden bg-clr-white">
    <header class="contentItem__header" style="border-top-color: rgb(0, 101, 50);">
      ...
    </header>
    <div>
      <div class="ResponsiveWrapper">
        <div class="contentItem__content--layoutLg contentItem__content overflow-hidden contentItem__content--standard hasImage hasVideo contentItem__content--fullWidth flex contentItem__content--media" aria-label="Why JWill and Max want patience with Tatum and Brown" style="height: auto;">
          <div class="contentItem__contentWrapper relative flex flex-column contentWrapper">
            <div class="ColorBorder absolute top-0 left-0 right-0" style="background-color: rgb(165, 166, 167);"></div>
            <ul class="contentItem__meta"></ul>
            <h2 class="contentItem__title">
              <span class="Truncate Truncate--collapsed">
                <span>Why JWill and Max want patience with Tatum and Brown</span>
              </span>
            </h2>
          ...
  </article>
</section>

We can see that all of the articles are within a section container, and each article is inside an article container. Furthermore, the title of the article has a consistent class contentItem__title .

This structure is consistent across all articles, so we can use a list type to ask for all of the titles of all of the articles:

curl -X POST 'https://api.webit.live/api/v1/realtime/web' \
--header 'Authorization: Basic <credential string>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "https://www.espn.com/nba/team/_/name/bos/boston-celtics",
    "parse": true,
    "format": "json",
    "render": true,
    "country": "US",
    "parser": {
        "team_name": {
            "type": "item",
            "selectors": [".ClubhouseHeader__Name"],
            "extractor": "text"
        },
        "team_standings": {
            "type": "table",
            "selectors": [".TeamStandings table"],
            "extractor": "text"
        },
        "articles": {
            "type": "list",
            "selectors": ["section > article .contentItem__title"],
            "extractor": "text"
        }
    }
}'

The list type will now provide us with a list of all the matched elements, and the text extractor cleans the HTML out and parses just the contents, providing us with the following result:

{
    "status": "success",
    "query_time": "2023-06-06T14:06:44.986Z",
    "status_code": 200,
    "html_content": "...",
    "headers": {
        ...
    },
    "parsing": {
        "team_name": "BostonCeltics",
        "team_standings": [
            {
                "GB": "-",
                "L": "25",
                "PCT": ".695",
                "STRK": "W3",
                "Team": "Boston",
                "W": "57"
            },
            ...
        ],
        "articles": [
            "Why JWill and Max want patience with Tatum and Brown",
            "When did the Heat start to win Game 7 over the Celtics? As soon as Game 6 ended",
            "Star deals, coaching plans and an uncertain summer: What lies ahead for Boston",
            ...
        ]
    },
    "url": "https://www.espn.com/nba/team/_/name/bos/boston-celtics"
}

Now that we have the article titles, we can parse the article links in a very similar fashion:

curl -X POST 'https://api.webit.live/api/v1/realtime/web' \
--header 'Authorization: Basic <credential string>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "https://www.espn.com/nba/team/_/name/bos/boston-celtics",
    "parse": true,
    "format": "json",
    "render": true,
    "country": "US",
    "parser": {
        "team_name": {
            "type": "item",
            "selectors": [".ClubhouseHeader__Name"],
            "extractor": "text"
        },
        "team_standings": {
            "type": "table",
            "selectors": [".TeamStandings table"],
            "extractor": "text"
        },
        "articles": {
            "type": "list",
            "selectors": ["section > article .contentItem__title"],
            "extractor": "text"
        },
        "articles_links": {
            "type": "list",
            "selectors": ["section > article a"],
            "extractor": "[href]"
        }
    }
}'

We've made some modifications in order to target the links:

The result of this request returns:

{
    "status": "success",
    "query_time": "2023-06-06T14:06:44.986Z",
    "status_code": 200,
    "html_content": "...",
    "headers": {
        ...
    },
    "parsing": {
        "team_name": "BostonCeltics",
        "team_standings": [
            {
                "GB": "-",
                "L": "25",
                "PCT": ".695",
                "STRK": "W3",
                "Team": "Boston",
                "W": "57"
            },
            ...
        ],
        "articles": [
            "Why JWill and Max want patience with Tatum and Brown",
            "When did the Heat start to win Game 7 over the Celtics? As soon as Game 6 ended",
            "Star deals, coaching plans and an uncertain summer: What lies ahead for Boston",
            ...
        ],
        "articles_links": [
            "/video/clip/_/id/37759456",
            "/nba/story/_/id/37757803/when-did-heat-start-win-game-7-celtics-soon-game-6-ended",
            "/nba/story/_/id/37603481/the-celtics-biggest-issues-came-roaring-back-game-7-now",
            ...
        ],
    },
    "url": "https://www.espn.com/nba/team/_/name/bos/boston-celtics"
}

Although parsing out the article titles and links in this way works, it can be a bit cumbersome as more properties are added. What if we wanted the author, publish date, image URL, and more?

This is where objects, and object-lists, comes in. We can use objects to define an article object, with multiple properties defined in the schema of the object. See Advanced example - using object-list to learn more.

In this example, we defined the parsing template inline, or within the request body. However, we recommend uploading your parsing templates and simply referring to them in each call for a more smooth experience at scale.

See the Implementing Parsing Templates section further down for more information.

Advanced example - Using object-list

In the previous example, we used the list type to get a list of article titles and article links, but we can achieve the same result more effectively and robustly by using the object-list type instead.

Before we get into object-list, let's understand what an object is.

An object consists of a single selector that identifies the target element on the webpage and a series of fields. Each field has it's own type, extractor, and selector. This is useful when wanting to extract complex elements from a page that have several relevant attributes. Some examples of use cases for objects include:

For a practical example, let's look at how we could use objects to collect the articles from our ESPN Celtics page. In the below example, we create an object-list that collects the title, URL, time elapsed since publish, and author for each article:

curl -X POST 'https://api.webit.live/api/v1/realtime/web' \
--header 'Authorization: Basic <credential string>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "https://www.espn.com/nba/team/_/name/bos/boston-celtics",
    "format": "json",
    "render": true,
    "country": "GR",
    "parse": true,
    "parser": {
        "articles": {
            "type": "object-list",
            "selectors": [
                "section > article"
            ],
            "fields": {
                "title": {
                    "type": "item",
                    "selectors": [
                        ".contentItem__title"
                    ],
                    "extractor": "text"
                },
                "link": {
                    "type": "item",
                    "selectors": [
                        "a"
                    ],
                    "extractor": "[href]"
                },
                "time_elapsed": {
                    "type": "item",
                    "selectors": [
                        ".time-elapsed"
                    ],
                    "extractor": "text"
                },
                "author": {
                    "type": "item",
                    "selectors": [".author"],
                    "extractor": "text"
                }
            }
        }
    }
}

First, we create a parsing template names articles.

Next, we start defining the fields of the object.

The result of the above request is:

{
    "status": "success",
    "query_time": "2023-06-06T14:05:39.129Z",
    "status_code": 200,
    "html_content": "...",
    "headers": {
        ...
    },
    "parsing": {
        "status": "success",
        "articles": [
            {
                "author": "",
                "link": "/video/clip/_/id/37759456",
                "time_elapsed": "1h",
                "title": "Why JWill and Max want patience with Tatum and Brown"
            },
            {
                "author": "Brian Windhorst",
                "link": "nba/story/_/id/37757803/when-did-heat-start-win-game-7-celtics-soon-game-6-ended",
                "time_elapsed": "7h",
                "title": "When did the Heat start to win Game 7 over the Celtics? As soon as Game 6 ended"
            },
            {
                "author": "Tim Bontemps",
                "link": "/nba/story/_/id/37603481/the-celtics-biggest-issues-came-roaring-back-game-7-now",
                "time_elapsed": "8h",
                "title": "Star deals, coaching plans and an uncertain summer: What lies ahead for Boston"
            },
            ...
        ],
        "entity_type": "Dynamic"
    },
    "url": "https://www.espn.com/nba/team/_/name/bos/boston-celtics"
}

In this example, we defined the parsing template inline, or within the request body. However, we recommend uploading your parsing templates and simply referring to them in each call for a more smooth experience at scale.

See the Implementing Parsing Templates section further down for more information.

Parsing template syntax

At its core, a parsing template is built of three properties:

  • type (required) - defines the format of the returned data. For example, setting type to json will instructed the parser to structure the extract data into JSON, and then return the JSON object.

  • selectors (required) - The CSS selector or selectors of the elements that should be extracted by the parser. Listing more than one selector creates fallback selectors, meaning that if the first selector isn’t found, the parser will look for the second, then the third, etc.

  • extractor - Once an element has been identified by its selector, the extractor defines what part of the element should be returned.

Types

Types define the return format of extracted data. Types are a required field, and can have the following values:

ValueDescription

item (Default)

Returns the contents of the first element matched by the defined CSS selector.

list

Returns a list of data points from all the matching elements of the CSS selector in a list.

json

Returns the contents of the first element (like item), but formatted into JSON.

table

Converts an HTML table into JSON, using the headers of the table as keys. Use this type when targeting <table></table> elements.

object

Define a custom object that is populated and returned. The structure/properties of the object are defined using fields (see below). The object will be populated using the first element that matches the defined CSS selector.

object-list

Returns a list of objects, populated by all the elements that match the defined CSS selector. The structure/properties of objects are defined using fields (see below).

Example - list

    ...
    "parser": {
        "template_name": {
            "type": "list",
            "selectors": ["h1,h2,h3"],
            "extractor": "text"
        }
    }

Example - json

    ...
    "parser": {
        "template_name": {
            "type": "json",
            "selectors": ["script[type='application/json']"]
        }
    }

Example - table

    ...
    "parser": {
        "template_name": {
            "type": "table",
            "selectors": [".someTable"]
        }
    }

Extractors

When an element matches the defined CSS selectors, the extractor defines which part of the matched element is extracted and returned. Extractors can have three possible values:

NameDescription

text (default)

Extracts the text of the selected element.

html

Extracts the full inner HTML of the selected element.

[attribute-name]

Extracts the value of an HTML attribute of the selected element (eg: id, href, etc.)

Extractors can only be used when type is set to one of:

  • item

  • list

  • json

Tables do not use extractors because the structure of the table defines the way the table is parsed. An object, and by extension object-lists, uses “fields” to define the data to be extracted.

Examples

Let's assume the page being parsed is made up of the following HTML:

<html>
    <head>
        <title>parsing demo</title>
    </head>
    <body>
        <div class="article">
            <p>
                Lorem ipsum dolor sit amet <span>consectetur adipiscing elit.</span> Duis sapien eros, euismod vel magna sodales,
                porttitor tristique mi. Phasellus vel lobortis mi, 
                <a href=\"https://www.somedomain.com\">nec pharetra risus.</a>
                Sed quis augue in ligula blandit ullamcorper non et elit. 
            </p>
        </div>
    </body>
</html>

In the below parsing template example, the first template “link” searches for the first link in the page (an element matching the “a” selector), and then uses the [attribute-name] extractor to extract the URL to which the link is pointing.

The second template looks for a div with the class “article” and extracts the full html contents.

	...
	"parser": {
		"link": {
			"selectors": ["a"],
			"extractor": "[href]"
		},
		"article": {
			"selectors": ["div.article"],
			"extractor": "html"
		}
	}

The response for this parsing template processing our example HTML would produce the following output:

{
	...
	"link": "https://www.somedomain.com",
	"article": "<p>Lorem ipsum dolor sit amet, <span>consectetur adipiscing elit.</span> Duis sapien eros, euismod vel magna sodales, porttitor tristique mi. Phasellus vel lobortis mi, <a href=\"https://www.somedomain.com\">nec pharetra risus.</a> Sed quis augue in ligula blandit ullamcorper non et elit. </p>"
}

Objects

Objects allow users to define a customized return structure that can capture data in a way that is more accessible and better represents the data they are trying to collect. For example, when collecting product data, a “product” object can be created, with fields for price, inventory, color, shipping method, and other contextually relevant factors.

Because objects are fully user-defined, different objects can be created for different sources, purposes, or any other use case!

An object has a type, which is always set to object, and selectors, which define the scope or parent element from which fields (which each have their own selectors) select from.

Fields make up the contents of the object, and each one has a title and a selector. The title defines the name of the field, and the selector defines the CSS selector that should be used to populate its value, where that element is a child of the object selector. For example:

    ...
    "product": {
        "type": "object",
        "selectors": [ ".product-card" ],
        "fields": {
            "name": {
                "selectors": [ ".name" ]
            },
            "price": {
                "selectors": [ ".price" ]
            },
            "average_rating": {
                "selectors": [ ".rating" ]
            }
        }
    }

In the above example, an object named “product” would be returned. It would have three children, “name”, “price”, and “average_rating”. The value for name would be extracted from the first element found with the class “name”, where that element is itself a child of the first element found with the class “product-card”.

The above object parsing template would parse this HTML:

<html>
    <head>
        <title>parsing demo</title>
    </head>
    <body>
        <div class="product-card">
            <div class="name">blue jeans</div>
            <div class="price">$50</div>
            <div class="rating">4/5</div>
        </div>
    </body>
</html>

into this output:

{
	...
	"product":{
		"name": "blue jeans",
		"price": "$50",
		"average_rating": "4/5"
	}
}

Object lists

An object list combines objects with lists, allowing users to create custom structures that are populated by multiple matching elements. This can be useful, for example, when collecting product data from a page that has multiple products, or to quickly extract SERP listings.

An object-list uses syntax that is very similar to a regular object, except that type is set to object-list instead of object. For example:

{
    "links": {
        "type": "object-list",
        "selectors": ["a"],
        "fields": {
            "url": {
                "selectors": ["*"],
                "extractor": "[href]"
            },
            "title": {
                "selectors": ["*"],
                "extractor": "text"
            }
        }
        
    }
}

The above example would parse out all of the links (all of the elements that have an "a" tag) in a webpage. For the following HTML webpage:

<html>
    <head>
        <title>parsing demo</title>
    </head>
    <body>
        <a href="https://www.somelink.com">Some link</a>
        <a href="https://www.anotherlink.com">Another lin</a>
    </body>
</html>

The output of our object-list template would be:

{
	...
	"links":[
		{
			"url": "https://www.somelink.com",
			"title": "Some link"
		},
		{
			"url": "https://www.anotherlink.com",
			"title": "Another link"
		},
		...
	]
}

Implementing parsing templates

Parsing templates can be implemented in one of two ways:

  • Inline - The parsing template’s rules are defined within the request body. All of the previous examples shown above have been inline implementations.

  • Upload - Custom parsers can be written separately and uploaded to the WebAPI, and then implemented by passing a parser value instead of being written inline.

We highly recommend uploading your parsing template to increase stability and performance. Parsing templates should only be used inline for testing and development purposes.

Managing parsing templates

Parsing templates can be managed through several API endpoints that allow uploading, viewing, updating, and deleting parsing templates.

Upload a new parsing template

POST https://api.webit.live/api/v1/parsers

{ "scheme": { "type": "item", "selectors": [".css-selector"], "extractor": "text" }, "name": "parsing template name" }

Request Body

NameTypeDescription

scheme*

String

{ /* A valid parser template */}

name*

String

A name for the parsing template that will later be used to update, delete, or implement it in requests.

{
    "id": "<parser id>",
    "message": "parser created",
    "success": true
}

View a parsing template

GET https://api.webit.live/api/v1/parsers/{parsing-template-name}

Path Parameters

NameTypeDescription

parsing-template-name*

String

The name of the parser to view.

{
    "id": "<parser id>",
    "name": "My Parser",
    "account": "my_account",
    "schema": { /* The parser's schema */},
    "created_at": "2022-11-16T15:18:24.525Z",
    "created_by": "username@my_account.com",
    "modified_at": "2022-11-16T15:18:24.525Z",
    "modified_by": "username@my_account.com",
}

List all uploaded parsing templates

GET https://api.webit.live/api/v1/parsers

{
    "parsers": [
        {
            "id": "<parser id>",
            "name": "My Parser",
            "account": "my_account",
            "schema": { /* The parser's schema */ },
            "created_at": "2022-11-16T15:18:24.525Z",
	    "created_by": "username@my_account.com",
	    "modified_at": "2022-11-16T15:18:24.525Z",
	    "modified_by": "username@my_account.com",
        },
        {
            "id": "<parser id>",
            "name": "My Second Parser",
            "account": "my_account",
            "schema": { /* The parser's schema */ },
            "created_at": "2022-11-16T15:18:24.525Z",
	    "created_by": "username@my_account.com",
	    "modified_at": "2022-11-16T18:39:54.015Z",
	    "modified_by": "username2@my_account.com",
        }
    ]
}

Delete a parsing template

DELETE https://api.webit.live/api/v1/parsers/{parsing-template-name}

Path Parameters

NameTypeDescription

parsing-template-name*

String

The parsing template to delete.

Update a parsing template

PUT https://api.webit.live/api/v1/parsers

Request Body

NameTypeDescription

scheme*

String

{ /* A valid parser scheme */}

name*

String

The name of the parser to update

{
    "parser": "<parser name>",
    "message": "parser updated",
    "success": true
}