Realtime, Async & Batch Request

What?

Real Time Request

A real-time scraping request executes the scraping task on the target domain and delivers the response immediately after the desired data is received

  • dependent on timeout settings and potentially affected by rendering delays (as data loads), the response is returned to the user as soon as it becomes available.

  • Supported Endpoints: Web, SERP, Maps, eCommerce and Social

Async Request

An asynchronous scraping request initiates the scraping task on the target domain and operates independently of the user's immediate session.

  • The user doesn't wait for a response directly after the request. Instead, once the scraping is complete and the data is ready, the user is notified through a callback URL or a specified cloud storage URL.

  • This method allows the scraping process to run in the background, enabling the user to continue with other tasks and receive the scraped data once it's available, without having to manage or monitor the ongoing process actively.

  • Supported Endpoints: Web, SERP, Maps, eCommerce and Social

Batch Request

Batch Processing feature allow users to perform queries of up to 1K URLs in a single batch request, significantly improving efficiency and reducing the time needed for large-scale data collection tasks:

  • Supports custom settings for each URL in a batch, including different geolocations, rendering options, and parsing templates, to meet specific data collection requirements.

  • Offers asynchronous processing, enabling data collection tasks to run in the background without interrupting other operations, and providing flexibility in handling large volumes of data.

  • Integrates with cloud storage solutions for automated data delivery, facilitating seamless workflow integration and immediate access to collected data.

  • Supported Endpoints: Web, SERP, Maps, eCommerce and Social

Why?

Real Time Request

  • Immediate & Accurate Data Access: Real-time scraping retrieves data as it is currently displayed on websites. This is crucial for obtaining the most up-to-date information, whether for monitoring prices, stock levels, news updates, or social sentiment. This immediacy ensures that decisions are based on the latest data available.

  • Event-Driven Responses: By scraping data in real time, you can trigger actions based on specific conditions or changes detected on the target website. For instance, receiving alerts when a product goes on sale, when a competitor changes their pricing, or when new content is posted, enabling prompt and relevant responses.

  • Event-Driven Responses: By scraping data in real time, you can trigger actions based on specific conditions or changes detected on the target website. For instance, receiving alerts when a product goes on sale, when a competitor changes their pricing, or when new content is posted, enabling prompt and relevant responses.

Async Request

  • Non-blocking Operations: Async requests allow other processes to run concurrently while the data request is being handled. This means that your application doesn't have to pause and wait for the data retrieval to complete, enhancing overall performance and user experience, especially in applications that handle large volumes of data or require high responsiveness.

  • Efficiency in Handling Multiple Requests: With async requests, you can send out multiple data requests simultaneously rather than waiting for each to complete sequentially. This is particularly useful in scenarios like data scraping, API data retrieval, or loading data from different sources, as it significantly speeds up the overall process.

  • Scalability: Asynchronous processing is more scalable as it better manages resources and handles increases in workload. For services that expect a high volume of requests or need to scale dynamically, async requests ensure that the system remains responsive and performant under varying loads.

Batch Request

  • Increased Efficiency: Handling up to 1,000 URLs in a single batch request streamlines operations by reducing the number of individual requests needed. This consolidation minimizes overhead associated with setting up and tearing down connections, thereby enhancing the overall efficiency of data processing.

  • Asynchronous Processing with Flexible Data Handling: The batch processing feature operates asynchronously, allowing users to execute other tasks while their request processes. With options to use a callback URL or direct cloud storage for results, this feature enhances workflow efficiency and offers versatile data management, seamlessly integrating with various systems and needs.

  • Reduced Resource Consumption: With fewer HTTP connections needed for the same amount of work, your server resources are better utilized. This not only optimizes server performance but also potentially lowers costs related to bandwidth and computing power.

  • Simplified Management: Managing one request with multiple URLs is simpler and more straightforward than handling numerous single-URL requests. This simplifies the workflow for developers and reduces the complexity of code needed to handle large volumes of data.

Asynchronous Request Process

Unlike real-time requests, asynchronous requests do not return the result directly to the client. Instead, an asynchronous request produces a “task” that runs in the background and delivers the resulting data to a cloud storage bucket and/or callback URL. Tasks go through four stages:

Status
Description

pending

The task is still being processed.

uploading

The results are being uploaded to the destination repository.

success

Task was complete and stored in the destination repository.

failed

Nimble was unable to complete the task, no file was created in the repository.

Delivery Methods

Nimble API supports three delivery methods:

For real-time delivery, see our page on performing a real-time URL request. To use Cloud or Push/Pull delivery, use an asynchronous request instead. Asynchronous requests also have the added benefit of running in the background, so you can continue working without waiting for the job to complete.

Request Option

Example - Realtime Request

A simple real-time request uses the following syntax

Path: https://api.webit.live/api/v1/realtime/...

Parameter
Required
Description

url

Required

URL | The page or resource to be fetched. Note: when using a URL with a query string, encode the URL and place it at the end of the query string

curl -X POST 'https://api.webit.live/api/v1/realtime/web' \
--header 'Authorization: Basic <credential string>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "https://www.example.com"
}'

Example - Async Request

A simple async request uses the following syntax

Path: https://api.webit.live/api/v1/async/...

Parameter
Required
Description

storage_type

Optional (default = push/pull)

ENUM: s3 | gs - Use s3 for Amazon S3 and gs for Google Cloud Platform.

Leave blank to enable Push/Pull delivery.

storage_url

Optional (default = push/pull)

Repository URL: s3://Your.Bucket.Name/your/object/name/prefix/ - Output will be saved to TASK_ID.json

Leave blank to enable Push/Pull delivery.

callback_url

Optional

A url to callback once the data is delivered. The WebAPI will send a POST request to the callback_url with the task details once the task is complete (this “notification” will not include the requested data).

storage_compress

Optional (default = false)

When set to true, the response saved to the storage_url will be compressed using GZIP format. This can help reduce storage size and improve data transfer efficiency. If not set or set to false, the response will be saved in its original uncompressed format.

Please add Nimble's system/service user to your GCS or S3 bucket to ensure that data can be delivered successfully.

curl -X POST 'https://api.webit.live/api/v1/async/web' \
--header 'Authorization: Basic <credential string>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "https://www.example.com",
    "method": "GET",
    "parse": true,
    "render": false,
    "storage_type": "s3",
    "storage_url" : "s3://Your.Repository.Path/",
    "callback_url": "https://your.callback.url/path"
}'

Initial Response

In response to triggering an asynchronous request, the details of the created task are returned, which can later be used to check its status. The response contains the Task ID, as well as other information, and is structured as follows:

{
    "status": "success",
    "task": {
        "id": "Task_ID",
        "state": "pending",
        "output_url": "s3://Your.Repository.Path/Task_ID.json",
        "callback_url": "https://your.callback.url/path",
        "status_url": "https://api.webit.live/api/v1/tasks/Task_ID",
        "created_at": "0000-00-00T00:00:00.000Z",
        "modified_at": "0000-00-00T00:00:00.000Z",
                "input": {
            "parse": "true",
            "render": "false",
            "storage_url": "s3://Your.Repository.Path/",
            "storage_type": "s3",
            "url": "https://www.example.com",
            "callback_url": "https://your.callback.url/path"
        }
    }
}

Retrieving Task Status (Async)

To check the status of an asynchronous task, use the endpoint https://api.webit.live/api/v1/tasks/<task_id>

Example Request

curl -X GET 'https://api.webit.live/api/v1/tasks/Task_ID' \
--header 'Authorization: Basic <credential string>'

The response object has the same structure as the Task Completion object that is sent to the callback_url upon task completion.

Example Response

A POST request will be sent to the callback_url once the task is complete which contains the following information:

{
    "status": "success",
    "task": {
        "id": "Task_ID",
        "state": "success",
        "output_url": "s3://Your.Repository.Path/Task_ID.json",
        "callback_url": "https://your.callback.url/path",
        "status_url": "https://api.webit.live/api/v1/tasks/Task_ID",
        "created_at": "0000-00-00T00:00:00.000Z",
        "modified_at": "0000-00-00T00:00:00.000Z",
    "input": {
    ...
        }
    }
}

Asynchronous requests also have methods for handling upload failures. For more information, see the Nimble Web API Documentation.

Example - Batch Request

A simple batch processing request uses the following syntax

Path: https://api.webit.live/api/v1/batch/...

  • Supporting up to 1000 URL within a signle batch request

Parameter
Required
Description

requests

Only when Batch processing is required

Object array - Allows for defining custom parameters for each request within the bulk. Any of the parameters below can be used in an individual request

storage_type

Optional (default = push/pull)

ENUM: s3 | gs - Use s3 for Amazon S3 and gs for Google Cloud Platform.

Leave blank to enable Push/Pull delivery.

storage_url

Optional (default = push/pull)

Repository URL: s3://Your.Bucket.Name/your/object/name/prefix/ - Output will be saved to TASK_ID.json

Leave blank to enable Push/Pull delivery.

callback_url

Optional

A url to callback once the data is delivered. The WebAPI will send a POST request to the callback_url with the task details once the task is complete (this “notification” will not include the requested data).

storage_compress

Optional (default = false)

When set to true, the response saved to the storage_url will be compressed using GZIP format. This can help reduce storage size and improve data transfer efficiency. If not set or set to false, the response will be saved in its original uncompressed format.

Please add Nimble's system/service user to your GCS or S3 bucket to ensure that data can be delivered successfully.

Example #1 - collecting data from multiple URLs

In this first example, we'll collect data from several unique URLs. To do so, we set the URLs we want to collect in the url fields of the requests object.

curl -X POST 'https://api.webit.live/api/v1/batch/web' \
--header 'Authorization: Basic <credential string>' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "requests": [
        { "url": "https://www.finance.com" },
        { "url": "https://www.travel.com" },
        { "url": "https://www.socialmedia.com" }
    ],
    "storage_type": "s3",
    "storage_url": "s3://Your.Repository.Path/",
    "callback_url": "https://your.callback.url/path"
}'

Parameters that are placed outside the requests object, such as storage_type, storage_url, and callback_url , are automatically applied as defaults to all defined requests.

If a parameter is set both inside and outside the requests object, the value inside the request overrides the one outside.

Example #2- collecting multiple URLs from multiple countries

In this example, we'll collect multiple URLs with a different country set for each URL. To do so, we'll take advantage of the requests object, which allows us to set any parameter inside each request:

curl -X POST 'https://api.webit.live/api/v1/batch/web' \
--header 'Authorization: Basic <credential string>' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "requests": [
        { "url": "https://www.finance.com", "country": "US", "locale": "en-US" },
        { "url": "https://www.travel.com",  "country": "FR", "locale": "fr" },
        { "url": "https://www.socialmedia.com",  "country": "GR", "locale": "de" },
        { "url": "https://www.searchengine.com" }
    ],
    "country": "CA", 
    "locale": "ca",
    "storage_type": "s3",
    "storage_url": "s3://Your.Repository.Path/",
    "callback_url": "https://your.callback.url/path"
}'

For the above request, each URL would be requested from the corresponding country. "examplefour.com" does not have a country set in its request, and thus will default to the country defined outside the requests object (CA - Canada). If no default country had been set, by default the request would have used a randomly selected country.

Example #3- collecting the same URL from different countries

Any parameter can be defined inside and/or outside the requests object. We can take advantage of this in some cases by setting our URL once as a default and setting various other parameters in the requests object. For example:

curl -X POST 'https://api.webit.live/api/v1/batch/web' \
--header 'Authorization: Basic <credential string>' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "requests": [
        { "country": "US", "locale": "en-US" },
        { "country": "FR", "locale": "fr" },
        { "country": "GR", "locale": "de" },
    ],
    "url": "https://www.finance.com",
    "storage_type": "s3",
    "storage_url": "s3://Your.Repository.Path/",
    "callback_url": "https://your.callback.url/path"
}'

In the above example, the URL "exampleone.com" would be requested three times - once from the US, once from France, and once from Germany.

Like asynchronous tasks, the status of a batch is available for 24 hours, and the user can check the batch progress status below.

Checking batch progress and status

https://api.webit.live/api/v1/batches/<batch_id>/progress

Like asynchronous tasks, the status of a batch is available for 24 hours.

curl -X GET 'https://api.webit.live/api/v1/batches/<batch_id>/progress' \
--header 'Authorization: Basic <credential string>'

Response

The progress of a batch is reported in percentages.

{
    "status": "success",
    "completed": false,
    "progress": 0.333333
}

Once a batch is finished, its progress will be reported as “1”.

{
    "status": "success",
    "completed": true,
    "progress": 1
}

Retrieving Batch Summary

Once a batch has finished, it’s possible to return a summary of the completed tasks, by using the following endpoint:https://api.webit.live/api/v1/batchs/<batch-id>

Example Request

curl -X GET 'https://api.webit.live/api/v1/batches/7a07a96d-c402-4d98-a17f-4ecb390d11a3' \
--header 'Authorization: Basic <credential string>'

The response object lists the status of the overall batch, as well as the individual tasks and their details:

Example Response

{
    "status": "success",
    "tasks": [
        {
            "batch_id": "7a07a96d-c402-4d98-a17f-4ecb390d11a3",
            "id": "2e508d43-8b02-4fc0-96c7-0968ab454a0c",
            "state": "success",
            "query_time": "2023-01-01T12:00:00.007Z",
            "output_url": "s3://Your.Repository.Path/2e508d43-8b02-4fc0-96c7-0968ab454a0c.json",
            "callback_url": "https://your.callback.url/path",
            "status_url": "https://[base_url]/api/v1/tasks/2e508d43-8b02-4fc0-96c7-0968ab454a0c",
            "created_at": "2022-07-24T08:09:23.205Z",
            "modified_at": "2022-07-24T08:10:27.244Z",
            "input": {
        ...
            }
        },
        {
            "batch_id": "7a07a96d-c402-4d98-a17f-4ecb390d11a3",
            "id": "63cc3bd5-01b4-4787-90a2-f382b9960c77",
            "state": "success",
            "query_time": "2023-01-01T12:00:00.007Z",
            "output_url": "s3://Your.Repository.Path/63cc3bd5-01b4-4787-90a2-f382b9960c77.json",
            "callback_url": "https://your.callback.url/path",
            "status_url": "https://[base_url]/api/v1/tasks/63cc3bd5-01b4-4787-90a2-f382b9960c77",
            "created_at": "2022-07-24T08:09:23.205Z",
            "modified_at": "2022-07-24T08:10:27.973Z",
            "input": {
        ...
            }
         },
        {
            "batch_id": "7a07a96d-c402-4d98-a17f-4ecb390d11a3",
            "id": "4cb39bbf-5580-4c50-8ed4-4a7905e2ec52",
            "state": "success",
            "query_time": "2023-01-01T12:00:00.007Z",
            "output_url": "s3://Your.Repository.Path/4cb39bbf-5580-4c50-8ed4-4a7905e2ec52.json",
            "callback_url": "https://your.callback.url/path",
            "status_url": "https://[base_url]/api/v1/tasks/4cb39bbf-5580-4c50-8ed4-4a7905e2ec52",
            "created_at": "2022-07-24T08:09:23.205Z",
            "modified_at": "2022-07-24T08:10:30.292Z",
            "input": {
        ...
            }
        }
    ],
    "completed": true,
    "progress": 1
}

Last updated