London Bike Data in PowerPivot
Posted by workerthread on January 24, 2011
One of my favourite places for interesting dataset downloads is the UK Guardian’s Datablog. They have everything from Expenses submitted by Members of Parliament to Bafta Nominations. Recently I noticed this post about the data collected London Bicycle Hire Scheme. Transport for London has provided a downloadable file of all bike hires during the first 5 months of the scheme.
If you take a look on the Guardian site you will see that there have already been some interesting visualisations done using the data, like this one on Tableau Public. The source download contains over 1.4 million rows and for each ride you can see starting and finishing location as well as dates and times, which bike was used and duration of the hire. So I thought, how well would this work in PowerPivot in Excel?
I downloaded the full ZIP file of data – over 156mb text file once unzipped, with a separate file describing the column names. If you want the column titles in place before loading into PowerPivot (so they can be used as is without the need for any renaming) you will need to use a text editor capable of loading the source Raw Data file. Something like NotePad++ works well here.
Firing up Excel and the PowerPivot window, first step is to get the data loaded. The data has a tab character Column Separator and if you have added the column headers into the top row make sure you checked the box – otherwise just rename the columns once imported –
Worth bearing in mind of course that 1.4 million rows is too big to load into Excel directly where you have a row limit of 1 million rows. But 1.4 million rows is fairly trivial for PowerPivot, which can handle much higher data volumes.
First thing I like to do once the data has been imported is take a roam around it to understand what’s on offer and whether any adjustments or cleaning needs to happen. You will see that the start and end time columns need to have their format changed. On load they show as dates (30/12/1899) and we need to set them to a time format to make them useful.
You will also see Start and Finish Station formats, such as “Queen Street,Bank”. I thought it might be more interesting if the start and finish stations could be split into the area and individual bike docking station. Area and docking station are in most (but not all) cases separated with a comma. So we can create calculated columns to split up the data to left or right of the comma with the addition of some conditional logic to handle any data which doesn’t contain a comma. Most of this works pretty much the same way as it in Excel and there is a good article on splitting data into separate columns here.
I’m already getting some useful data, but lets take things a little further –
We have Start and End dates for each ride, so we can follow PowerPivot best Practice and create Date dimensions for each, as linked Excel tables. You will find some very useful articles online on how to do this – I would particularly recommend this excerpt from the invaluable book by Marco Russo and Alberto Ferrari (or buy the actual book – well worth it). Once I have my date dimension joined to the source data I can easily get to day of the week, calculate weekdays or weekends etc and if I want I can use DAX Time Intelligence Functions. I also created calculated columns to show for hour of the day between 0 and 23 and a simple AM or PM column based on that hour.
Next I took a look at adding a linked table which could be used to “band” the journey times. The table I created in Excel looked like this:
and I used the CALCULATE and FILTER functions described here to create a banding column in my data.
Interesting point to note here – take a look at the size of my saved Excel file containing the loaded PowerPivot data and compare it to the size of the source text file I loaded:
43,885 KB in PowerPivot compared to 159,653 KB for the original text file!
Now I can start exploring the data by creating some pivot tables and charts. This Pivot Chart shows total bike hires by hour of the day with a “starting area” slicer and the table below shows the area in which the bike hires ended up, also linked to the same filter:
And this one uses the calculated “banding” column to group journey times by month and starting location:
At this point you can easily find out a lot more – average journey times, which bikes got used the most, maybe some “ratio to parent” calculations to see which docking stations are used most in each area.
To sum up, for me this reinforced how quickly you can get some interesting analyses out of large data sets with one of the most widely used desktop tools out there. If you are interested in trying this for yourself the first thing I would suggest is getting hold of this book (or if you are not a technical book person, maybe get yourself booked on one of the PowerPivot Workshops).
If you try working with the bike data yourself, either in PowerPivot or one of the other analysis or visualisation tools leave a comment on the blog. And hopefully Transport for London will make even more data available – maybe I can try a full year’s worth next time.
Sorry, the comment form is closed at this time.