Scrapy
Scrapy is a popular open-source web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It is written in Python and provides a complete toolset for scraping tasks.
Scrapy simplifies the process of writing complex spiders, which are programs that browse the Web and extract data based on a set of instructions. It's highly extensible, allowing for the implementation of custom functionality through plugins, and it can handle a wide range of web scraping and crawling tasks, making it an ideal choice for projects ranging from simple data extraction to large-scale web mining.
Configuration
Setting Up Your Nimble Account
If you haven't already, you'll need to create an account with Nimble to access their Web API here.
Configure Scrapy Settings
The first step is to install Nimble's Scrapy middleware using pip
:
Next, configure your Scrapy project to interact with Nimble's Web API by updating your settings.py:
Then, add the Nimble middleware to your downloader middlewares:
Ensure that the Nimble middleware is configured to run before the default scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware
which is enabled by default in DOWNLOADER_MIDDLEWARES_BASE set at an order of 590.
Basic Usage
Middleware Handling
With the middleware configured, every request sent from your Scrapy spiders will automatically pass through the Nimble Web API. There's no need for additional changes in your spider code for basic usage.
Advanced Features
Real-time URL Requests
The Nimble Web API enhances your scraping capabilities with options for real-time URL requests. This feature allows for dynamic content rendering, geolocated requests, and more.
To use these features, you add specific options in the meta section of your request. Here’s how you can specify these options:
Development Environment Setup
Python Environment
It's recommended to use pyenv for managing Python versions and creating an isolated development environment:
Now, your development environment is set up, and you're ready to develop your Scrapy project with Nimble's Web API.