How to Find the Longest Consecutive Series of Events in SQL
A very interesting problem that can be solved very easily with SQL is to find consecutive series of events in a time series. But what is a consecutive series of events in a time series?
Take Stack Overflow, for example. Stack Overflow has a cool reputation system that uses badges to reward certain behaviour. As a social website, they encourage users to visit the platform every day. As such, two distinct badges are awarded:
Informally, it is obvious what this means. You’ll have to log in on day 1. Then again on day 2. Then again (perhaps several times, it doesn’t matter) on day 3. Forgot to log in on day 4? Ooops. We’ll start counting again.
How to do this in SQL?
On this blog, every problem will find its solution in SQL. So does this. And in order to solve this problem, we’re going to use the awesome Stack Exchange Data Explorer, which exposes a lot of Stack Exchange’s publicly available usage information.
Note that we won’t query the consecutive days of visits, as this information is not made available publicly. Instead, let’s query the consecutive days of posts a user has made.
The backing database is SQL Server, so we can run the following statement:
SELECT DISTINCT CAST(CreationDate AS DATE) AS date FROM Posts WHERE OwnerUserId = ##UserId## ORDER BY 1
… which, for my own UserId
generates something like:
date ---------- 2010-11-26 2010-11-27 2010-11-29 2010-11-30 2010-12-01 2010-12-02 2010-12-03 2010-12-05 2010-12-06 2010-12-07 2010-12-08 2010-12-09 2010-12-13 2010-12-14 ... (769 rows)
(run the statement yourself, here)
As we can see in the data, there have been gaps in the very early days:
date -------------------------------------- 2010-11-26 2010-11-27 <---- Gap here after 2 days 2010-11-29 2010-11-30 2010-12-01 2010-12-02 2010-12-03 <---- Gap here after 5 days 2010-12-05 2010-12-06 2010-12-07 2010-12-08 2010-12-09 <---- Gap here after 5 days 2010-12-13 2010-12-14 ...
Visually, it is very easy to see how many days in a row there were posts without any gaps. But how to do it with SQL?
To simplify the problem, let’s “store” individual queries in common table expressions. The above query, we’ll call dates
:
WITH -- This table contains all the distinct date -- instances in the data set dates(date) AS ( SELECT DISTINCT CAST(CreationDate AS DATE) FROM Posts WHERE OwnerUserId = ##UserId## ) ...
Now, the goal of the resulting query is to put all consecutive dates in the same group, such that we can aggregate over this group. The following query is what we want to write:
SELECT COUNT(*) AS consecutiveDates, MIN(week) AS minDate, MAX(week) AS maxDate FROM groups GROUP BY grp ORDER BY 1 DESC, 2 DESC
We’d like to aggregate each group “grp
” and count the number of dates in the group, as well as find the lowest and the highest date within each group.
Generating groups for consecutive dates
Let’s look at the data again, and to illustrate the idea, we’ll add consecutive row numbers, regardless of the gaps in dates:
row number date -------------------------------- 1 2010-11-26 2 2010-11-27 3 2010-11-29 <-- gap before this row 4 2010-11-30 5 2010-12-01 6 2010-12-02 7 2010-12-03 8 2010-12-05 <-- gap before this row
As you can see, regardless whether there is a gap between dates (two dates are not consecutive), their row numbers will still be consecutive. We can do this with the ROW_NUMBER()
window function, very easily:
SELECT ROW_NUMBER() OVER (ORDER BY date) AS [row number], date FROM dates
Now, let’s check out the following, interesting query:
WITH -- This table contains all the distinct date -- instances in the data set dates(date) AS ( SELECT DISTINCT CAST(CreationDate AS DATE) FROM Posts WHERE OwnerUserId = ##UserId## ), -- Generate "groups" of dates by subtracting the -- date's row number (no gaps) from the date itself -- (with potential gaps). Whenever there is a gap, -- there will be a new group groups AS ( SELECT ROW_NUMBER() OVER (ORDER BY date) AS rn, dateadd(day, -ROW_NUMBER() OVER (ORDER BY date), date) AS grp, date FROM dates ) SELECT * FROM groups ORDER BY rn
The above query yields:
rn grp date --- ---------- ---------- 1 2010-11-25 2010-11-26 2 2010-11-25 2010-11-27 3 2010-11-26 2010-11-29 4 2010-11-26 2010-11-30 5 2010-11-26 2010-12-01 6 2010-11-26 2010-12-02 7 2010-11-26 2010-12-03 8 2010-11-27 2010-12-05 9 2010-11-27 2010-12-06 10 2010-11-27 2010-12-07 11 2010-11-27 2010-12-08 12 2010-11-27 2010-12-09 13 2010-11-30 2010-12-13 14 2010-11-30 2010-12-14
(run the statement yourself, here)
All we did is subtract the row number from the date to get a new date “grp
“. The actual date obtained this way is irrelevant. It’s just an auxiliary value.
What we can guarantee, though, is that for consecutive dates, the value of grp
will be the same because for all consecutive dates, the following two equations yield true:
date2 - date1 = 1 // difference in days between dates rn2 - rn1 = 1 // difference in row numbers
Yet, for non-consecutive dates, while the difference in row numbers is still 1, the difference in days is no longer 1. The groups can now be seen easily:
rn grp date --- ---------- ---------- 1 2010-11-25 2010-11-26 2 2010-11-25 2010-11-27 3 2010-11-26 2010-11-29 4 2010-11-26 2010-11-30 5 2010-11-26 2010-12-01 6 2010-11-26 2010-12-02 7 2010-11-26 2010-12-03 8 2010-11-27 2010-12-05 9 2010-11-27 2010-12-06 10 2010-11-27 2010-12-07 11 2010-11-27 2010-12-08 12 2010-11-27 2010-12-09 13 2010-11-30 2010-12-13 14 2010-11-30 2010-12-14
Thus, the complete query can now be seen here:
WITH -- This table contains all the distinct date -- instances in the data set dates(date) AS ( SELECT DISTINCT CAST(CreationDate AS DATE) FROM Posts WHERE OwnerUserId = ##UserId## ), -- Generate "groups" of dates by subtracting the -- date's row number (no gaps) from the date itself -- (with potential gaps). Whenever there is a gap, -- there will be a new group groups AS ( SELECT ROW_NUMBER() OVER (ORDER BY date) AS rn, dateadd(day, -ROW_NUMBER() OVER (ORDER BY date), date) AS grp, date FROM dates ) SELECT COUNT(*) AS consecutiveDates, MIN(week) AS minDate, MAX(week) AS maxDate FROM groups GROUP BY grp ORDER BY 1 DESC, 2 DESC
And it yields:
consecutiveDates minDate maxDate ---------------- ------------- ------------- 14 2012-08-13 2012-08-26 14 2012-02-03 2012-02-16 10 2013-10-24 2013-11-02 10 2011-05-11 2011-05-20 9 2011-06-30 2011-07-08 7 2012-01-17 2012-01-23 7 2011-06-14 2011-06-20 6 2012-04-10 2012-04-15 6 2012-04-02 2012-04-07 6 2012-03-26 2012-03-31 6 2011-10-27 2011-11-01 6 2011-07-17 2011-07-22 6 2011-05-23 2011-05-28 ...
(run the statement yourself, here)
Bonus query 1: Find consecutive weeks
The fact that we chose the granularity of days in the above query is a random choice. We simply took the timestamp from our time series and “collapsed” it to the desired granularity using a CAST
function:
SELECT DISTINCT CAST(CreationDate AS DATE)
If we want to know the consecutive weeks, we’ll simply change that function to a different expression, e.g.
SELECT DISTINCT datepart(year, CreationDate) * 100 + datepart(week, CreationDate)
This new expression takes the year and the week and generates values like 201503 for week 03 in the year 2015. The rest of the statement remains exactly the same:
WITH weeks(week) AS ( SELECT DISTINCT datepart(year, CreationDate) * 100 + datepart(week, CreationDate) FROM Posts WHERE OwnerUserId = ##UserId## ), groups AS ( SELECT ROW_NUMBER() OVER (ORDER BY week) AS rn, dateadd(day, -ROW_NUMBER() OVER (ORDER BY week), week) AS grp, week FROM weeks ) SELECT COUNT(*) AS consecutiveWeeks, MIN(week) AS minWeek, MAX(week) AS maxWeek FROM groups GROUP BY grp ORDER BY 1 DESC, 2 DESC
And we’ll get the following result:
consecutiveWeeks minWeek maxWeek ---------------- ------- ------- 45 201401 201445 29 201225 201253 25 201114 201138 23 201201 201223 20 201333 201352 16 201529 201544 15 201305 201319 12 201514 201525 12 201142 201153 9 201502 201510 7 201447 201453 7 201321 201327 6 201048 201053 4 201106 201109 3 201329 201331 3 201102 201104 2 201301 201302 2 201111 201112 1 201512 201512
(run the statement yourself, here)
Unsurprisingly, the consecutive weeks span much longer ranges, as I generally use Stack Overflow extensively.
Bonus query 2: Simplify the query using DENSE_RANK()
In a previous article, we’ve shown that SQL Trick: ROW_NUMBER()
is to SELECT
what DENSE_RANK()
is to SELECT DISTINCT
.
If we go back to our consecutive days example, we can rewrite the query to find the distinct dates AND the groups in one go, using DENSE_RANK()
:
WITH groups(date, grp) AS ( SELECT DISTINCT CAST(CreationDate AS DATE), dateadd(day, -DENSE_RANK() OVER (ORDER BY CAST(CreationDate AS DATE)), CAST(CreationDate AS DATE)) AS grp FROM Posts WHERE OwnerUserId = ##UserId## ) SELECT COUNT(*) AS consecutiveDates, MIN(date) AS minDate, MAX(date) AS maxDate FROM groups GROUP BY grp ORDER BY 1 DESC, 2 DESC
(run the statement yourself, here)
If the above doesn’t make sense, I recommend reading our previous article here, which explains it:
Further reading
The above has been one very useful example of using window functions (ROW_NUMBER()
) in SQL. Learn more about window functions in any of the following articles:
- Use this Neat Window Function Trick to Calculate Time Differences in a Time Series
- Probably the Coolest SQL Feature: Window Functions
- Don’t Miss out on Awesome SQL Power with FIRST_VALUE(), LAST_VALUE(), LEAD(), and LAG()
- The Difference Between ROW_NUMBER(), RANK(), and DENSE_RANK()
Reference: | How to Find the Longest Consecutive Series of Events in SQL from our JCG partner Lukas Eder at the JAVA, SQL, AND JOOQ blog. |