In this post we’ll see how to clean data, and how to deal with geographical information in Python. This post is part of a data science project of the room rental prices in Vancouver.
By scraping a room ads website, we collected information about the room listings in Vancouver, a set of records was extracted from each post:
However, data collected from the web (really, from any source) is rarely free of errors, therefore we need to get rid of basic inconsistencies before starting the statistical analysis.
Additionally, we would like to produce new features that will aid the modeling stage. This is very important because with great, predictive features, we have potential to produce a sensible model.
Raw content from websites is usually messy, just look at the typical post title:
In its raw form, the title contains redundant and/or useless information:
<b>
HTML tagsTo get rid of all these issues it is sufficient to do some filtering with regular expressions. If you don’t know what regular expressions are, you really need to learn about this in the Python regex tutorial.
To apply the operations on our strings, we’ll make use of the excellent Pandas library. Pandas is surprisingly efficient for this kind of processing, as it implements vectorized operations that are concise, performant and easy to use.
The idea is very simple, whenever you have a pandas.Series
object containing strings, you can access a variety of string-vectorized operation through the attribute str
. In the following code we apply the lower
, strip
and replace
operations in sequence.
Finally, to remove all the html tags from the content we’ll use the fast lxml library and the method apply
:
The Google Maps links scraped from the posts usually contain the exact latitude and longitude of the geographic location. While it is possible to use directly those variables in a model, it is much better to transform them in real-world locations, as this simplifies modeling and interpretation.
Geographical data is made available through public websites. I was able to obtain geographic files for Vancouver from the OpenData catalog, which provides a huge variety publicly accessible data sets (a lot of them). To obtain boundaries for the Vancouver neighbourhoods I used the local area boundary data set in KML format.
OpenStreetMap, is another highly recommended source of geographical data, especially when you need coast lines, streets and maps in general.
First of all, let’s talk on which data we’re dealing with. What we have is geographic boundaries, that means a set of polygons that delimit a geographical area, this data is usually referred as vector data.
Polygons however are not the only kind of vector data that can go in a map. Other kinds of geometrical primitives are points, lines and their combinations. Data formats manage this information in different ways, but the building blocks are generally the same.
My favorite data format is GeoJSON because it’s intuitive and web-friendly. An example is as follows (taken from the official spec). There is a toplevel object of type FeatureCollection
, made of a list of geometric features, for example Point
, LineString
and Polygon
. Along with those it is possible to store extra properties:
We downloaded the neighbourhoods in KML format (which is sort-of an XML), how do we transform to GeoJSON? We can use the software qgis (available in the Ubuntu repositories). Converting beetween file formats is quite easy using the tool ogr2ogr
:
To read GeoJSON in Python we’ll use the library Shapely, to deal with geometrical objects in general (but especially geographical data).
Loading stuff in Shapely is pretty easy, each geometry
attribute (see the above example) can be transformed in a shape
object very easily. In the following example we load the GeoJSON file in Python and then create shapes from the geometry
attributes. Finally, we create a MultiPolygon
object from the extracted shapes, which is a container for multiple polygons.
You can use the following comand to easily plot the neighbourhood geometries with matplotlib. For example, you can use the functions fill_polygon
and fill_multipolygon
to plot Polygon
and MultiPolygon
objects:
For more interesting plots I recommend the library folium that is able to overlay polygons on top of a real geographic map. Another option is basemap or if you’re feeling javascripty you can use d3.js.
The geographical data is basically a set of polygons corresponding to the Vancouver neighbourhoods. To assign each post to a neighbourhood we have to test if its latitude-longitude point is into the corresponding polygon. Geometry operations are Shapely territory.
For each Polygon
in the variable shapes
, we check if any of the points is contained using the method contains
, and we associate the corresponding neighbourhood name.
The result is an association of each post with its neighbourhood. Unfortunately many posts don’t include a map, for these we can put a NaN
value or the string Unknown
.
title | neigh | |
---|---|---|
38822 | near downtown furnished room | South Cambie |
38823 | private master bedroom in amazing building | Downtown |
38827 | nice bedroom for rent may west end girl ... | NaN |
38828 | room for rent | NaN |
38830 | looking for roommate from may 1st to july 31st | Mount Pleasant |
In this post we described the process of cleaning data and extracting features, in the next post we’ll proceed with the modeling phase, where we relate the prices with the post features.
Part 3: Natural Language Modeling and Feature Selection in Python