In this post we’ll describe how I downloaded 1000 room listings per day from a popular website, and extracted the information I needed (like price, description and title).
This is the first of a series of 3 posts on my project Room Prices in Vancouver, make sure to read it for nice insights about the room situation in Vancouver!
The amount of data that circulates the web is enormous, and a lot of those data is property of big companies. If you’re like me, you probably don’t own any data and one of the way of pulling it from the web is scraping.
Scraping involves fetching a web page, extracting data, following links contained in the page and repeating the process from the beginning until we’re satisfied.
There’s quite a bit of libraries that allows to do scraping in a lot of languages:
In my case I picked scrapy because of my familiarity with it and because it has a lot of neat features out-of-the-box.
While I’m not going to do a step by step tutorial, which is included in the Scrapy Documentation, I’ll give an overview of the steps involved, highlighting specific points not covered (or buried) in the documentation.
First of all, we need to initialize a scrapy project, that is basically a collection of components that make up the whole scraping. To create a project template you do:
Scrapy will create a project skeleton, which consist in configuration files plus a set of components needed to be implemented.
Iteminstance, that is the juiced information from the page and recursively follow links.
The following diagram illustrates the relationships in a friendly and colorful way.
A first step in the crawler development is to define a data structure to contain our data. In this goes in the module
room_spiders/items.py. The definition is nothing special, just a class with a set of fields:
The core of the scraper is the
Spider. To define a Spider in Scrapy we need to create a Python file in the subdirectory
room_spiders/spiders/ and code a new class that inherits from
A scraper needs to some configurations such as which pages to fetch and which links to follow. All of this can be specified using the following class attributes:
allowed_domains: a list of domains that we are allowed to scrape
start_urls: The starting point (or points) of our spiders.
rules: A list of
Ruleinstances that specify which URL to parse, the parsing function to be called for each page, and the following behaviors.
The following snippet illustrates those concepts in an example spider (file
When we launch the scrapy executable (with the command
scrapy crawl), the spider will match the URL specified in the rules and call the appropriate parsing function (in this case
The parsing code is implemented in the
parse_roo method, that takes a
Response object as its only argument.
In the Scrapy framework, HTML elements are fetched from the page using the XPath syntax that lets you easily navigate the HTML tags and attributes.
An example is as follows. The title and content were easily extracted by referring to the appropriate
id attributes in the page. Notice also the longitude and latitude of the posting were extracted from the
a google map link, when present.
Remember also that at every change of layout the scraper can get completely thrown off, make sure to keep monitor if there’s any error when you run the spider with
How do we manage to scrape all the pages we need? One approach is to set the option
follow=True in the scraping rules, that instructs the scraper to follow links:
However that simply keeps parsing all the listings available in the website. A better solution is to set
follow=False and write multiple
start_urls entries, corresponding to the different “paginations” of the listing search page:
Now that we produced an
Item from our page, how do we store it?
One simple way to store is by using the builtin feed exports in Scrapy:
However I found that appending stuff is much more efficient by using a
Pipeline and saving the items to a database (in my case Postgres).
Pipeline is just a class defined in
pipelines.py that takes an
Item as an input, and outputs another
Item. Of course in between we can have any side effects we want, including additional data storage and logging. In the following snippet we connect to a Postgres database and store the items in a table called
I setup a Raspberry Pi 2 to run the spider every day at 23:12 by creating a script with the crawing command, and running it using a simple cron job.
12 23 * * * /path/to/scraping/script
If you want the time to be not exact you can always let your script sleep for some seconds before starting the scraper. In the following code we sleep between 1 and 10 seconds.
Spiders are quite powerful and may submit tons of request in a short period of time. However, this is not a good practice as it burdens the servers, and websites typically adopt countermeasures to prevent excessive load.
Here are a few tips to avoid getting banned:
download_delayin your spider)
In the next post we’ll see how to clean the raw-data and how to geolocate Vancouver neighbourhoods from latitude and longitude.