Tech Blog

AnyMind Group

October 20, 2021

[Tech Blog] Scrapy with Django Integration

This blog is about Integrating Django with scrapy. The goal is to retrieve any public data availble on the world wide web or the internet using Scrapy and then build a service on top of it using Django.

Let’s just dive right into it with an overview of web-scraping and scrapy.

What is scraping

Scraping is the process of data mining. Also known as web data extraction, web harvesting, spying. It is software that simulates human interaction with a web page to retrieve any wanted information (eg images, text, videos). This is done by a scraper.

This scraper involves making a GET request to a website and parsing the html response. The scraper then searches for the data required within the html and repeats the process until we have collected all the data we want.

There are many reasons to do web scraping such as lead generation and market analysis. However, when scraping websites, you must always be careful not to violate the terms and conditions of websites you are scraping, or to violate anyone’s privacy. This is why it is thought to be a little controversial but can save many hours of searching through sites, and logging the data manually. All of those hours saved mean a great deal of money saved.

Scrapy

Scrapy is python based framwork, which is widely used for scraping. It allows you to define data structures, write extractions, provide post and pre pipelines to perform actions on the request and response of a web-request. It also provides built-in xpath/css selectors to extract the desired data. It also gives you control over the speed and rate in which you want to make request to a certain site/domain so you follow the privacy rules.

Websites tend to have countermeasures to prevent excessive requests, so Scrapy randomises the time between each request by default which can help avoid getting banned from them. Scrapy can also be used for automated testing and monitoring.

Scrapy and Django

As per our goal, we need to extract data from the publicly availble sites using scrapy and save it directly to DB as per over service and serve it in our application.

Scrapy also used to have a built in class called DjangoItem which is now an easy to use external library. The DjangoItem library provides us with a class of item that uses the field defined in a Django model just by specifying which model it is related to. The class also provides a method to create and populate the Django model instance with the item data from the pipeline. This library allows us to integrate Scrapy and Django easily and means we can also have access to all the data directly in the admin!

Spiders

spiders are the core blocks for scrapy. A simple scrapper consists of a spider an item (which defines the structre of the data to be extracted), scrapper settings and middlewares and pipelines if required any depeding upon the requirements.

Scrapy has a start_requests method which generates a request with the URL. When Scrapy fetches a website according to the request, it will parse the response to a callback method specified in the request object. The callback method can generate an item from the response data or generate another request.

What happens behind the scenes? Everytime we start a Scrapy task, we start a crawler to do it. The Spider defines how to perform the crawl (ie following links). The crawler has an engine to drive it’s flow. When a crawler starts, it will get the spider from its queue, which means the crawler can have more than one spider. The next spider will then be started by the crawler and scheduled to crawl the webpage by the engine. The engine middlewares drive the flow of the crawler. The middlewares are organised in chains to process requests and responses.

Selectors/Extractors

Scrapy provides built-in xpath and css selectors to extract the speific data you need or the links you want to follow in your crawler.

Xpath selects nodes in XML docs (that can also be used in HTML docs) and CSS is a language for applying styles to HTML documents. CSS selectors use the HTML classes and id tag names to select the data within the tags. Scrapy in the background using the cssselect library transforms these CSS selectors into xpath selectors.

Items and Pipeline

Items produce the output. They are used to structure the data parsed by the spider. The Item Pipeline is where the data is processed once the items have been extracted from the spiders. Here we can run tasks such as validation and storing items in a database.

Steps

Create a django project
Setup your django project along with the app
Define the models according to the data you want to extract
Install scrapy
Create a scrapy project using the command scrapy startproject scraper
Connect your django model item in the scraper project items using DjanogModelItem

scrapper/items.py
      from scrapy_djangoitem import DjangoItem
      from app.models import ModelItem

      class ScrapyItem(DjangoItem):
        django_model = ModelItem

Setup crawler and the extractor to scrape the data
Spider should return the data item as defined in the django model
Create a pipeline to store the extracted item in the db

scrapper/pipelines.py
    class ScrapyItemPipeline(object):
      def process_item(self, item, spider):
          item.save()
          return item

Enable the pipeline from the settings

scraper/settings.py
      ITEM_PIPELINES = {"scraper.pipelines.TheodoTeamPipeline": 300}

Once you start your project it’ll connect the extracted data items with your django model and the pipeline is going to save it the DB

There are alot of applications that uses scrapy to gather publicly availble data. It basically automates the process of gathering large amount of data.

I hope this blog is useful for beginners looking to built small application on top of publicly availbe data. Thanks

← Back to List

Latest News

TGIF AnyMind Pride Fest: Defining “Pride” at AnyMind

Blog

AnyMind Group

July 26, 2024

MAT Insight Visit: AnyMind Group Opens Its Doors to Share Knowledge on Influencer Marketing and Owned Media, from Upper Funnel to Lower Funnel in the E-Commerce World

Blog

AnyMind Group

July 26, 2024

Unveiling AnyManager SDK: Revolutionizing Ad Monetization for Mobile App Publishers

Blog

Publisher Growth

July 23, 2024

Tech Blog

[Tech Blog] Scrapy with Django Integration

What is scraping

Scrapy

Scrapy and Django

Spiders

Selectors/Extractors

Items and Pipeline

Steps

Latest News

TGIF AnyMind Pride Fest: Defining “Pride” at AnyMind

MAT Insight Visit: AnyMind Group Opens Its Doors to Share Knowledge on Influencer Marketing and Owned Media, from Upper Funnel to Lower Funnel in the E-Commerce World

Unveiling AnyManager SDK: Revolutionizing Ad Monetization for Mobile App Publishers

Career

Contact Us