This blog is about Integrating Django with scrapy. The goal is to retrieve any public data availble on the world wide web or the internet using Scrapy and then build a service on top of it using Django.
Let’s just dive right into it with an overview of web-scraping and scrapy.
What is scraping
Scraping is the process of data mining. Also known as web data extraction, web harvesting, spying. It is software that simulates human interaction with a web page to retrieve any wanted information (eg images, text, videos). This is done by a scraper.
This scraper involves making a GET request to a website and parsing the html response. The scraper then searches for the data required within the html and repeats the process until we have collected all the data we want.
There are many reasons to do web scraping such as lead generation and market analysis. However, when scraping websites, you must always be careful not to violate the terms and conditions of websites you are scraping, or to violate anyone’s privacy. This is why it is thought to be a little controversial but can save many hours of searching through sites, and logging the data manually. All of those hours saved mean a great deal of money saved.
Scrapy
Scrapy is python based framwork, which is widely used for scraping. It allows you to define data structures, write extractions, provide post and pre pipelines to perform actions on the request and response of a web-request. It also provides built-in xpath/css selectors to extract the desired data. It also gives you control over the speed and rate in which you want to make request to a certain site/domain so you follow the privacy rules.
Websites tend to have countermeasures to prevent excessive requests, so Scrapy randomises the time between each request by default which can help avoid getting banned from them. Scrapy can also be used for automated testing and monitoring.
Scrapy and Django
As per our goal, we need to extract data from the publicly availble sites using scrapy and save it directly to DB as per over service and serve it in our application.
Scrapy also used to have a built in class called DjangoItem which is now an easy to use external library. The DjangoItem library provides us with a class of item that uses the field defined in a Django model just by specifying which model it is related to. The class also provides a method to create and populate the Django model instance with the item data from the pipeline. This library allows us to integrate Scrapy and Django easily and means we can also have access to all the data directly in the admin!
Spiders
spiders are the core blocks for scrapy. A simple scrapper consists of a spider an item (which defines the structre of the data to be extracted), scrapper settings and middlewares and pipelines if required any depeding upon the requirements.
Scrapy has a start_requests method which generates a request with the URL. When Scrapy fetches a website according to the request, it will parse the response to a callback method specified in the request object. The callback method can generate an item from the response data or generate another request.
What happens behind the scenes? Everytime we start a Scrapy task, we start a crawler to do it. The Spider defines how to perform the crawl (ie following links). The crawler has an engine to drive it’s flow. When a crawler starts, it will get the spider from its queue, which means the crawler can have more than one spider. The next spider will then be started by the crawler and scheduled to crawl the webpage by the engine. The engine middlewares drive the flow of the crawler. The middlewares are organised in chains to process requests and responses.
Selectors/Extractors
Scrapy provides built-in xpath and css selectors to extract the speific data you need or the links you want to follow in your crawler.
Xpath selects nodes in XML docs (that can also be used in HTML docs) and CSS is a language for applying styles to HTML documents. CSS selectors use the HTML classes and id tag names to select the data within the tags. Scrapy in the background using the cssselect library transforms these CSS selectors into xpath selectors.
Items and Pipeline
Items produce the output. They are used to structure the data parsed by the spider. The Item Pipeline is where the data is processed once the items have been extracted from the spiders. Here we can run tasks such as validation and storing items in a database.
Steps
- Create a django project
- Setup your django project along with the app
- Define the models according to the data you want to extract
- Install scrapy
- Create a scrapy project using the command
scrapy startproject scraper
- Connect your django model item in the scraper project items using DjanogModelItem
scrapper/items.py
from scrapy_djangoitem import DjangoItem
from app.models import ModelItem
class ScrapyItem(DjangoItem):
django_model = ModelItem
- Setup crawler and the extractor to scrape the data
- Spider should return the data item as defined in the django model
- Create a pipeline to store the extracted item in the db
scrapper/pipelines.py
class ScrapyItemPipeline(object):
def process_item(self, item, spider):
item.save()
return item
- Enable the pipeline from the settings
scraper/settings.py
ITEM_PIPELINES = {"scraper.pipelines.TheodoTeamPipeline": 300}
- Once you start your project it’ll connect the extracted data items with your django model and the pipeline is going to save it the DB
There are alot of applications that uses scrapy to gather publicly availble data. It basically automates the process of gathering large amount of data.
I hope this blog is useful for beginners looking to built small application on top of publicly availbe data. Thanks