R: Vectorising all the things
After my last post about finding the distance a date/time is from the weekend Hadley Wickham suggested I could improve the function by vectorising it…
@markhneedham vectorise with pmin(pmax(dateToLookup – before, 0), pmax(after – dateToLookup, 0)) / dhours(1)
— Hadley Wickham (@hadleywickham) December 14, 2014
…so I thought I’d try and vectorise some of the other functions I’ve written recently and show the two versions.
I found the following articles useful for explaining vectorisation and why you might want to do it:
- Vectorisation in R: Why?
- Chapters 3 and 4 of R Inferno
- Drastic R speed ups via vectorisation and bug fixes
Let’s get started.
Distance from the weekend
We want to find out how many hours away from the weekend i.e. nearest Saturday/Sunday a particular date/time is. We’ll be using the following libraries and set of date/times:
library(dplyr) library(lubridate) library(geosphere) options("scipen"=100, "digits"=4) times = ymd_hms("2002-01-01 17:00:00") + c(0:99) * hours(1) data = data.frame(time = times)
> data %>% head() time 1 2002-01-01 17:00:00 2 2002-01-01 18:00:00 3 2002-01-01 19:00:00 4 2002-01-01 20:00:00 5 2002-01-01 21:00:00 6 2002-01-01 22:00:00
Let’s have a look at the non vectorised version first:
distanceFromWeekend = function(dateToLookup) { before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59) after = ceiling_date(dateToLookup, "week") - days(1) timeToBefore = dateToLookup - before timeToAfter = after - dateToLookup if(timeToBefore < 0 || timeToAfter < 0) { 0 } else { if(timeToBefore < timeToAfter) { timeToBefore / dhours(1) } else { timeToAfter / dhours(1) } } }
Now let’s run it against our data frame:
> system.time( data %>% mutate(ind = row_number()) %>% group_by(ind) %>% mutate(dist = distanceFromWeekend(time)) ) user system elapsed 1.837 0.020 1.884
And now for Hadley’s vectorised version:
distanceFromWeekendVectorised = function(dateToLookup) { before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59) after = ceiling_date(dateToLookup, "week") - days(1) pmin(pmax(dateToLookup - before, 0), pmax(after - dateToLookup, 0)) / dhours(1) } > system.time(data %>% mutate(dist = distanceFromWeekendVectorised(time))) user system elapsed 0.020 0.001 0.023
Extracting start date
My next example was from cleaning up Google Trends data and extracting the start date from a cell inside a CSV file.
We’ll use this data frame:
googleTrends = read.csv("/Users/markneedham/Downloads/report.csv", row.names=NULL) names(googleTrends) = c("week", "score")
> googleTrends %>% head(10) week score 1 Worldwide; 2004 - present 2 Interest over time 3 Week neo4j 4 2004-01-04 - 2004-01-10 0 5 2004-01-11 - 2004-01-17 0 6 2004-01-18 - 2004-01-24 0 7 2004-01-25 - 2004-01-31 0 8 2004-02-01 - 2004-02-07 0 9 2004-02-08 - 2004-02-14 0 10 2004-02-15 - 2004-02-21 0
The non vectorised version looked like this:
> system.time( googleTrends %>% mutate(ind = row_number()) %>% group_by(ind) %>% mutate(dates = strsplit(week, " - "), start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character()) ) user system elapsed 0.215 0.000 0.214
In this case it’s actually not possible to vectorise the code using the strsplit so we need to use something else. Antonios showed me how to do so using substr:
> system.time(googleTrends %>% mutate(start = substr(week, 1, 10) %>% ymd())) user system elapsed 0.018 0.000 0.017
Calculating haversine distance
I wanted to work out the great circular distance from a collection of venues to a centre point in London. I started out with this data frame:
centre = c(-0.129581, 51.516578) venues = read.csv("/tmp/venues.csv") > venues %>% head() venue lat lon 1 Skills Matter 51.52 -0.09911 2 Skinkers 51.50 -0.08387 3 Theodore Bullfrog 51.51 -0.12375 4 The Skills Matter eXchange 51.52 -0.09923 5 The Guardian 51.53 -0.12234 6 White Bear Yard 51.52 -0.10980
My non vectorised version looked like this:
> system.time(venues %>% mutate(distanceFromCentre = by(venues, 1:nrow(venues), function(row) { distHaversine(c(row$lon, row$lat), centre) })) ) user system elapsed 0.034 0.000 0.033
It’s pretty quick but we can do better – the distHaversine function allows us to calculate multiple distances if the first argument ot it is a matrix of lon/lat values rather than a vector:
> system.time( venues %>% mutate(distanceFromCentre = distHaversine(cbind(venues$lon, venues$lat), centre)) ) user system elapsed 0.001 0.000 0.001
One I can’t figure out…
And finally I have a function which I can’t figure out how to vectorise but maybe someone with more R skillz than me can?
I have a data frame containing the cumulative member counts of various NoSQL London groups:
cumulativeMeetupMembers = read.csv("/tmp/cumulativeMeetupMembers.csv") > cumulativeMeetupMembers %>% sample_n(10) g.name dayMonthYear n 4734 Hadoop Users Group UK 2013-10-26 1144 4668 Hadoop Users Group UK 2013-08-03 979 4936 Hadoop Users Group UK 2014-07-31 1644 5150 Hive London 2012-10-15 109 8020 Neo4j - London User Group 2014-03-15 826 7666 Neo4j - London User Group 2012-08-06 78 1030 Big Data London 2013-03-01 1416 6500 London MongoDB User Group 2013-09-21 952 8290 Oracle Big Data 4 the Enterprise 2012-06-04 61 2584 Data Science London 2012-03-20 285
And I want to find out the number of members for a group on a specific date. e.g. given the following data…
> cumulativeMeetupMembers %>% head(10) g.name dayMonthYear n 1 Big Data / Data Science / Data Analytics Jobs 2013-01-29 1 2 Big Data / Data Science / Data Analytics Jobs 2013-02-06 15 3 Big Data / Data Science / Data Analytics Jobs 2013-02-07 28 4 Big Data / Data Science / Data Analytics Jobs 2013-02-10 31 5 Big Data / Data Science / Data Analytics Jobs 2013-02-18 33 6 Big Data / Data Science / Data Analytics Jobs 2013-03-27 38 7 Big Data / Data Science / Data Analytics Jobs 2013-04-16 41 8 Big Data / Data Science / Data Analytics Jobs 2013-07-17 53 9 Big Data / Data Science / Data Analytics Jobs 2013-08-28 58 10 Big Data / Data Science / Data Analytics Jobs 2013-11-11 63
…the number of members for the ‘Big Data / Data Science / Data Analytics Jobs’ group on the 10th November 2013 should be 58.
I created this data frame of groups and random dates:
dates = ymd("2014-09-01") + c(0:9) * weeks(1) groups = cumulativeMeetupMembers %>% distinct(g.name) %>% select(g.name) groupsOnDate = merge(dates, groups) names(groupsOnDate) = c('date', 'name') > groupsOnDate %>% sample_n(10) date name 156 2014-10-06 GridGain London 153 2014-09-15 GridGain London 70 2014-11-03 Couchbase London 185 2014-09-29 Hadoop Users Group UK 105 2014-09-29 Data Science London 137 2014-10-13 Equal Experts Technical Meetup Group 360 2014-11-03 Scale Warriors of London 82 2014-09-08 Data Science & Business Analytics London Meetup 233 2014-09-15 London ElasticSearch User Group 84 2014-09-22 Data Science & Business Analytics London Meetup
The non vectorised version looks like this:
memberCount = function(meetupMembers) { function(groupName, date) { (meetupMembers %>% filter(g.name == groupName & dayMonthYear < date) %>% do(tail(., 1)))$n } } findMemberCount = memberCount(cumulativeMeetupMembers) > system.time(groupsOnDate %>% mutate(groupMembers = by(groupsOnDate, 1:nrow(groupsOnDate), function(row) { findMemberCount(row$name, as.character(row$date)) }) %>% cbind() %>% as.vector() )) user system elapsed 2.259 0.005 2.269
The output looks like this:
date name groupMembers 116 2014-10-06 DeNormalised London 157 322 2014-09-08 OpenCredo Tech Workshops 7 71 2014-09-01 Data Enthusiasts London 233 2014-09-15 London ElasticSearch User Group 614 171 2014-09-01 HPC & GPU Supercomputing Group of London 80 109 2014-10-27 Data Science London 3632 20 2014-11-03 Big Data Developers in London 708 42 2014-09-08 Big Data Week London Meetup 96 127 2014-10-13 Enterprise Search London Meetup 575 409 2014-10-27 Women in Data 548
I’ve tried many different approaches but haven’t been able to come up with a version that lets me pass in all the rows to memberCount and calculate the count for each row in one go.
Any ideas/advice/hints welcome!
Reference: | R: Vectorising all the things from our JCG partner Mark Needham at the Mark Needham Blog blog. |