Data Parsing

What?

Transforming raw HTML into clean, accurate, and useable data is no easy task. With each website having its own unique layout and unpredictable updates, it's important to have a diverse set of powerful tools to ensure consistent and accurate data extraction.

Nimble's Web API comes built-in with three tools to help you effectively extract the key data you need easily, reliably, and at scale.

Let's look at each one in more detail and examine some examples to understand when it's right to use each one.

Parsing Templates

Nimble Parsing Templates provide users with an easy to use, surgical parsing tool for parsing with a high degree of control and specificity. Parsing Templates provide a set of functions (called Types, Extractors, and Objects) that users can harness to accurately parse the exact web data they want.

Parsing Templates offer similar levels of accuracy and freedom to Beautiful Soup, but with significantly less complexity. Their goal is to help users fill gaps left by automated systems when collecting data from unorthodox or highly-specialized sources.

However, unlike Beautiful Soup, Parsing Templates have a much lower learning curve, and operate seamlessly alongside AI Parsing Skills, allowing for them to be used in parallel or independently from Nimble's other parsing solutions.

Learn more about Parsing Templates ->

Merge Dynamic Parser

The Merge Dynamics feature enables users to combine Nimble's AI-powered parsing with their own custom parsing logic into a single, unified response.

This allows for a highly customizable and flexible approach to data extraction, where the precision and automation of AI parsing can be enhanced or tailored by incorporating specific user-defined parsing rules.

The result is a comprehensive and cohesive data set that aligns perfectly with your unique requirements.

This feature is particularly useful for scenarios where standard AI parsing might need refinement or additional context provided by custom logic, ensuring that the final output meets your exact needs.

learn more

Nimble AI Parsing Skills (Beta)

Nimble AI Parsing Skills empower engineers to easily parse web data from any webpage into accurate, consistent JSON structures. By combining HTML-trained LLMs with classical parsing techniques, AI Parsing Skills make scalable parsing of any quantity and variety of web pages in real-time possible.

  • Automatic mode: in automatic mode, no user input is needed at all. Simply enable parsing, and our system does the rest. Behind the scenes, Nimble uses our built-in collection of generic parsing skills to extract data from webpages. Results are generally good, but may vary from page to page.

  • Skills Mode (coming soon) : In Skills mode, the user creates a simple, plain-English schema that guides the creation of custom parsers - also called Skills.

Why?

  1. Enhanced Accuracy: LLMs are adept at understanding the context and structure of web content, enabling them to parse complex web data more accurately than traditional parsing tools. This results in higher-quality data extraction, particularly from sophisticated web pages including site stricture changes.

  2. Scalability: AI models can handle a wide range of website layouts and structures without needing specific rules for each site. This scalability makes it easier to process data from a broad spectrum of sources with minimal setup time.

  3. Continuity: Unlike traditional parsers that require pre-defined schemas and are often brittle to changes in web page design, AI-based parsing adapts to changes in webpage layouts and content schemes, reducing the need for frequent manual updates.

  4. Efficiency: By automating the structuring of data into usable formats, this feature saves significant time and effort that would otherwise be spent on manual data cleaning and organization. This allows users to focus on analysis and insights rather than data preprocessing.

  5. Integration Readiness: The structured data output from AI Parsing is readily integrable into various data analysis tools and applications, enhancing the workflow from data collection to actionable insights.

Which tool is right for me?

Each tool has its own unique advantages and disadvantages. The below table should help clarify the features of each individual tool, and help you decide which is right for you. It's also important to remember that these tools can operate in parallel within each request, and we encourage users to try out each one and experiment to get the best results

AI Parsing Skills
Parsing Templates
Merge Dynamic

Fully-automated

Manual control

Auto-healing

Easy to use

CSS Selector targeting

Additional Information

  • Supported by realtime (except cloud delivery), asynchronous, and batch requests.

  • Supported Endpoints: Web, SERP, Maps and eCommerce.

  • Not supported Endpoints: Social

Request Option

Enable Parsing

To run Nimble API request that requires data parsing (HTML -> JSON), the user simply needs to include the parse parameter to true. Behind the scenes, the Nimble AI Parser will dynamically parse the webpage HTML content into structured data format (JSON).

Data Formatting

To set Nimble API data response format as JSON (instead of HTML), the user simply needs to include the parameter "format": JSON in the body of the request. Actually this is the default value of format param so the user don't need manually set it, but this is configurable.

Parameter
Required
Description

parse

Optional (default = false)

Enum: true | false - True - the page's content will be parsed and returned in a JSON format. False - Response will include page headers and raw data (without parsing).

format

Optional (default = JSON)

Enum: JSON | HTML - The data response format. HTML - in case of error, returns JSON with error message.

When setting parse as true, the format must be set to JSON (which is the default format)

Example Request

  • Actually no need as JSON is the default value of format

curl -X POST 'https://api.webit.live/api/v1/realtime/web' \
--header 'Authorization: Basic <credential string>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "https://www.google.com",
    "parse": true,
    "format": "json"
}'

Next Steps

Dive into the full guides for each of Nimble's parsing solutions:

Last updated