In this post we’ll see how to clean data, and how to deal with geographical information in Python. This post is part of a data science project of the room rental prices in Vancouver.
By scraping a room ads website, we collected information about the room listings in Vancouver, a set of records was extracted from each post:
However, data collected from the web (really, from any source) is rarely free of errors, therefore we need to get rid of basic inconsistencies before starting the statistical analysis.
Additionally, we would like to produce new features that will aid the modeling stage. This is very important because with great, predictive features, we have potential to produce a sensible model.
Raw content from websites is usually messy, just look at the typical post title:
In its raw form, the title contains redundant and/or useless information:
To get rid of all these issues it is sufficient to do some filtering with regular expressions. If you don’t know what regular expressions are, you really need to learn about this in the Python regex tutorial.
To apply the operations on our strings, we’ll make use of the excellent Pandas library. Pandas is surprisingly efficient for this kind of processing, as it implements vectorized operations that are concise, performant and easy to use.
The idea is very simple, whenever you have a
pandas.Series object containing strings, you can access a variety of string-vectorized operation through the attribute
str. In the following code we apply the
replace operations in sequence.
Finally, to remove all the html tags from the content we’ll use the fast lxml library and the method
The Google Maps links scraped from the posts usually contain the exact latitude and longitude of the geographic location. While it is possible to use directly those variables in a model, it is much better to transform them in real-world locations, as this simplifies modeling and interpretation.
Geographical data is made available through public websites. I was able to obtain geographic files for Vancouver from the OpenData catalog, which provides a huge variety publicly accessible data sets (a lot of them). To obtain boundaries for the Vancouver neighbourhoods I used the local area boundary data set in KML format.
OpenStreetMap, is another highly recommended source of geographical data, especially when you need coast lines, streets and maps in general.
First of all, let’s talk on which data we’re dealing with. What we have is geographic boundaries, that means a set of polygons that delimit a geographical area, this data is usually referred as vector data.
Polygons however are not the only kind of vector data that can go in a map. Other kinds of geometrical primitives are points, lines and their combinations. Data formats manage this information in different ways, but the building blocks are generally the same.
My favorite data format is GeoJSON because it’s intuitive and web-friendly. An example is as follows (taken from the official spec). There is a toplevel object of type
FeatureCollection, made of a list of geometric features, for example
Polygon. Along with those it is possible to store extra properties:
We downloaded the neighbourhoods in KML format (which is sort-of an XML), how do we transform to GeoJSON? We can use the software qgis (available in the Ubuntu repositories). Converting beetween file formats is quite easy using the tool
To read GeoJSON in Python we’ll use the library Shapely, to deal with geometrical objects in general (but especially geographical data).
Loading stuff in Shapely is pretty easy, each
geometry attribute (see the above example) can be transformed in a
shape object very easily. In the following example we load the GeoJSON file in Python and then create shapes from the
geometry attributes. Finally, we create a
MultiPolygon object from the extracted shapes, which is a container for multiple polygons.
You can use the following comand to easily plot the neighbourhood geometries with matplotlib. For example, you can use the functions
fill_multipolygon to plot
The geographical data is basically a set of polygons corresponding to the Vancouver neighbourhoods. To assign each post to a neighbourhood we have to test if its latitude-longitude point is into the corresponding polygon. Geometry operations are Shapely territory.
Polygon in the variable
shapes, we check if any of the points is contained using the method
contains, and we associate the corresponding neighbourhood name.
The result is an association of each post with its neighbourhood. Unfortunately many posts don’t include a map, for these we can put a
NaN value or the string
|38822||near downtown furnished room||South Cambie|
|38823||private master bedroom in amazing building||Downtown|
|38827||nice bedroom for rent may west end girl ...||NaN|
|38828||room for rent||NaN|
|38830||looking for roommate from may 1st to july 31st||Mount Pleasant|
In this post we described the process of cleaning data and extracting features, in the next post we’ll proceed with the modeling phase, where we relate the prices with the post features.