Realtime, Async & Batch Request
What?
Real Time Request
A real-time scraping request executes the scraping task on the target domain and delivers the response immediately after the desired data is received
dependent on timeout settings and potentially affected by rendering delays (as data loads), the response is returned to the user as soon as it becomes available.
Async Request
An asynchronous scraping request initiates the scraping task on the target domain and operates independently of the user's immediate session.
The user doesn't wait for a response directly after the request. Instead, once the scraping is complete and the data is ready, the user is notified through a callback URL or a specified cloud storage URL.
This method allows the scraping process to run in the background, enabling the user to continue with other tasks and receive the scraped data once it's available, without having to manage or monitor the ongoing process actively.
Batch Request
Batch Processing feature allow users to perform queries of up to 1K URLs in a single batch request, significantly improving efficiency and reducing the time needed for large-scale data collection tasks:
Supports custom settings for each URL in a batch, including different geolocations, rendering options, and parsing templates, to meet specific data collection requirements.
Offers asynchronous processing, enabling data collection tasks to run in the background without interrupting other operations, and providing flexibility in handling large volumes of data.
Integrates with cloud storage solutions for automated data delivery, facilitating seamless workflow integration and immediate access to collected data.
Why?
Real Time Request
Immediate & Accurate Data Access: Real-time scraping retrieves data as it is currently displayed on websites. This is crucial for obtaining the most up-to-date information, whether for monitoring prices, stock levels, news updates, or social sentiment. This immediacy ensures that decisions are based on the latest data available.
Event-Driven Responses: By scraping data in real time, you can trigger actions based on specific conditions or changes detected on the target website. For instance, receiving alerts when a product goes on sale, when a competitor changes their pricing, or when new content is posted, enabling prompt and relevant responses.
Event-Driven Responses: By scraping data in real time, you can trigger actions based on specific conditions or changes detected on the target website. For instance, receiving alerts when a product goes on sale, when a competitor changes their pricing, or when new content is posted, enabling prompt and relevant responses.
Async Request
Non-blocking Operations: Async requests allow other processes to run concurrently while the data request is being handled. This means that your application doesn't have to pause and wait for the data retrieval to complete, enhancing overall performance and user experience, especially in applications that handle large volumes of data or require high responsiveness.
Efficiency in Handling Multiple Requests: With async requests, you can send out multiple data requests simultaneously rather than waiting for each to complete sequentially. This is particularly useful in scenarios like data scraping, API data retrieval, or loading data from different sources, as it significantly speeds up the overall process.
Scalability: Asynchronous processing is more scalable as it better manages resources and handles increases in workload. For services that expect a high volume of requests or need to scale dynamically, async requests ensure that the system remains responsive and performant under varying loads.
Batch Request
Increased Efficiency: Handling up to 1,000 URLs in a single batch request streamlines operations by reducing the number of individual requests needed. This consolidation minimizes overhead associated with setting up and tearing down connections, thereby enhancing the overall efficiency of data processing.
Asynchronous Processing with Flexible Data Handling: The batch processing feature operates asynchronously, allowing users to execute other tasks while their request processes. With options to use a callback URL or direct cloud storage for results, this feature enhances workflow efficiency and offers versatile data management, seamlessly integrating with various systems and needs.
Reduced Resource Consumption: With fewer HTTP connections needed for the same amount of work, your server resources are better utilized. This not only optimizes server performance but also potentially lowers costs related to bandwidth and computing power.
Simplified Management: Managing one request with multiple URLs is simpler and more straightforward than handling numerous single-URL requests. This simplifies the workflow for developers and reduces the complexity of code needed to handle large volumes of data.
Asynchronous Request Process
Unlike real-time requests, asynchronous requests do not return the result directly to the client. Instead, an asynchronous request produces a “task” that runs in the background and delivers the resulting data to a cloud storage bucket and/or callback URL. Tasks go through four stages:
pending
The task is still being processed.
uploading
The results are being uploaded to the destination repository.
success
Task was complete and stored in the destination repository.
failed
Nimble was unable to complete the task, no file was created in the repository.
Delivery Methods
Nimble API supports three delivery methods:
For real-time delivery, see our page on performing a real-time URL request. To use Cloud or Push/Pull delivery, use an asynchronous request instead. Asynchronous requests also have the added benefit of running in the background, so you can continue working without waiting for the job to complete.
Request Option
Example - Realtime Request
A simple real-time request uses the following syntax
Path: https://api.webit.live/api/v1/
realtime
/...
url
Required
URL | The page or resource to be fetched. Note: when using a URL with a query string, encode the URL and place it at the end of the query string
Example - Async Request
A simple async request uses the following syntax
Path: https://api.webit.live/api/v1/
async
/...
storage_type
Optional (default = push/pull)
ENUM: s3 | gs - Use s3 for Amazon S3 and gs for Google Cloud Platform.
Leave blank to enable Push/Pull delivery.
storage_url
Optional (default = push/pull)
Repository URL: s3://Your.Bucket.Name/your/object/name/prefix/ - Output will be saved to TASK_ID.json
Leave blank to enable Push/Pull delivery.
callback_url
Optional
A url to callback once the data is delivered. The WebAPI will send a POST request to the callback_url with the task details once the task is complete (this “notification” will not include the requested data).
storage_compress
Optional (default = false)
When set to true
, the response saved to the storage_url
will be compressed using GZIP format. This can help reduce storage size and improve data transfer efficiency. If not set or set to false
, the response will be saved in its original uncompressed format.
Please add Nimble's system/service user to your GCS or S3 bucket to ensure that data can be delivered successfully.
Initial Response
In response to triggering an asynchronous request, the details of the created task are returned, which can later be used to check its status. The response contains the Task ID, as well as other information, and is structured as follows:
Retrieving Task Status (Async)
To check the status of an asynchronous task, use the endpoint https://api.webit.live/api/v1/
tasks
/
<task_id>
Example Request
The response object has the same structure as the Task Completion object that is sent to the callback_url upon task completion.
Example Response
A POST request will be sent to the callback_url once the task is complete which contains the following information:
Asynchronous requests also have methods for handling upload failures. For more information, see the Nimble Web API Documentation.
Example - Batch Request
A simple batch processing request uses the following syntax
Path: https://api.webit.live/api/v1/
batch
/...
Supporting up to 1000 URL within a signle batch request
requests
Only when Batch processing is required
Object array - Allows for defining custom parameters for each request within the bulk. Any of the parameters below can be used in an individual request
storage_type
Optional (default = push/pull)
ENUM: s3 | gs - Use s3 for Amazon S3 and gs for Google Cloud Platform.
Leave blank to enable Push/Pull delivery.
storage_url
Optional (default = push/pull)
Repository URL: s3://Your.Bucket.Name/your/object/name/prefix/ - Output will be saved to TASK_ID.json
Leave blank to enable Push/Pull delivery.
callback_url
Optional
A url to callback once the data is delivered. The WebAPI will send a POST request to the callback_url with the task details once the task is complete (this “notification” will not include the requested data).
storage_compress
Optional (default = false)
When set to true
, the response saved to the storage_url
will be compressed using GZIP format. This can help reduce storage size and improve data transfer efficiency. If not set or set to false
, the response will be saved in its original uncompressed format.
Please add Nimble's system/service user to your GCS or S3 bucket to ensure that data can be delivered successfully.
Example #1 - collecting data from multiple URLs
In this first example, we'll collect data from several unique URLs. To do so, we set the URLs we want to collect in the url
fields of the requests
object.
Parameters that are placed outside the requests
object, such as storage_type
, storage_url
, and callback_url
, are automatically applied as defaults to all defined requests.
If a parameter is set both inside and outside the requests
object, the value inside the request overrides the one outside.
Example #2- collecting multiple URLs from multiple countries
In this example, we'll collect multiple URLs with a different country set for each URL. To do so, we'll take advantage of the requests object, which allows us to set any parameter inside each request:
For the above request, each URL would be requested from the corresponding country. "examplefour.com" does not have a country set in its request, and thus will default to the country defined outside the requests
object (CA - Canada). If no default country had been set, by default the request would have used a randomly selected country.
Example #3- collecting the same URL from different countries
Any parameter can be defined inside and/or outside the requests
object. We can take advantage of this in some cases by setting our URL once as a default and setting various other parameters in the requests object. For example:
In the above example, the URL "exampleone.com" would be requested three times - once from the US, once from France, and once from Germany.
Like asynchronous tasks, the status of a batch is available for 24 hours, and the user can check the batch progress status below.
Checking batch progress and status
https://api.webit.live/api/v1/
batches
/
<batch_id>
/
progress
Like asynchronous tasks, the status of a batch is available for 24 hours.
Response
The progress of a batch is reported in percentages.
Once a batch is finished, its progress will be reported as “1”.
Retrieving Batch Summary
Once a batch has finished, it’s possible to return a summary of the completed tasks, by using the following endpoint:https://api.webit.live/api/v1/
batchs
/
<batch-id>
Example Request
The response object lists the status of the overall batch, as well as the individual tasks and their details:
Example Response
Last updated