In this post we’ll describe how I downloaded 1000 room listings per day from a popular website, and extracted the information I needed (like price, description and title).
This is the first of a series of 3 posts on my project Room Prices in Vancouver, make sure to read it for nice insights about the room situation in Vancouver!
The amount of data that circulates the web is enormous, and a lot of those data is property of big companies. If you’re like me, you probably don’t own any data and one of the way of pulling it from the web is scraping.
Scraping involves fetching a web page, extracting data, following links contained in the page and repeating the process from the beginning until we’re satisfied.
There’s quite a bit of libraries that allows to do scraping in a lot of languages:
In my case I picked scrapy because of my familiarity with it and because it has a lot of neat features out-of-the-box.
While I’m not going to do a step by step tutorial, which is included in the Scrapy Documentation, I’ll give an overview of the steps involved, highlighting specific points not covered (or buried) in the documentation.
First of all, we need to initialize a scrapy project, that is basically a collection of components that make up the whole scraping. To create a project template you do:
Scrapy will create a project skeleton, which consist in configuration files plus a set of components needed to be implemented.
Item
instance, that is the juiced information from the page and recursively follow links.The following diagram illustrates the relationships in a friendly and colorful way.
A first step in the crawler development is to define a data structure to contain our data. In this goes in the module room_spiders/items.py
. The definition is nothing special, just a class with a set of fields:
The core of the scraper is the Spider
. To define a Spider in Scrapy we need to create a Python file in the subdirectory room_spiders/spiders/
and code a new class that inherits from scrapy.contrib.spiders.CrawlSpider
.
A scraper needs to some configurations such as which pages to fetch and which links to follow. All of this can be specified using the following class attributes:
allowed_domains
: a list of domains that we are allowed to scrapestart_urls
: The starting point (or points) of our spiders.rules
: A list of Rule
instances that specify which URL to parse, the parsing function to be called for each page, and the following behaviors.The following snippet illustrates those concepts in an example spider (file room_spiders/spiders/room_spider.py
):
When we launch the scrapy executable (with the command scrapy crawl
), the spider will match the URL specified in the rules and call the appropriate parsing function (in this case parse_roo
)
The parsing code is implemented in the parse_roo
method, that takes a Response
object as its only argument.
In the Scrapy framework, HTML elements are fetched from the page using the XPath syntax that lets you easily navigate the HTML tags and attributes.
An example is as follows. The title and content were easily extracted by referring to the appropriate
id
attributes in the page. Notice also the longitude and latitude of the posting were extracted from the
a google map link, when present.
Remember also that at every change of layout the scraper can get completely thrown off, make sure to keep monitor if there’s any error when you run the spider with scrapy crawl
.
How do we manage to scrape all the pages we need? One approach is to set the option follow=True
in the scraping rules, that instructs the scraper to follow links:
However that simply keeps parsing all the listings available in the website. A better solution is to set follow=False
and write multiple start_urls
entries, corresponding to the different “paginations” of the listing search page:
Now that we produced an Item
from our page, how do we store it?
One simple way to store is by using the builtin feed exports in Scrapy:
However I found that appending stuff is much more efficient by using a Pipeline
and saving the items to a database (in my case Postgres).
A Pipeline
is just a class defined in pipelines.py
that takes an Item
as an input, and outputs another Item
. Of course in between we can have any side effects we want, including additional data storage and logging. In the following snippet we connect to a Postgres database and store the items in a table called raw_data
.
I setup a Raspberry Pi 2 to run the spider every day at 23:12 by creating a script with the crawing command, and running it using a simple cron job.
12 23 * * * /path/to/scraping/script
If you want the time to be not exact you can always let your script sleep for some seconds before starting the scraper. In the following code we sleep between 1 and 10 seconds.
Spiders are quite powerful and may submit tons of request in a short period of time. However, this is not a good practice as it burdens the servers, and websites typically adopt countermeasures to prevent excessive load.
Here are a few tips to avoid getting banned:
download_delay
in your spider)In the next post we’ll see how to clean the raw-data and how to geolocate Vancouver neighbourhoods from latitude and longitude.
Go to Part 2: Data Cleaning and Geolocation with Python and Shapely