Do You Really Understand SQL’s GROUP BY and HAVING clauses?
There are some things in SQL that we simply take for granted without thinking about them properly.
One of these things are the GROUP BY
and the less popular HAVING
clauses. Let’s look at a simple example. For this example, we’ll reiterate the example database we’ve seen in this previous article about the awesome LEAD(), LAG(), FIRST_VALUE(), LAST_VALUE() functions:
CREATE TABLE countries ( code CHAR(2) NOT NULL, year INT NOT NULL, gdp_per_capita DECIMAL(10, 2) NOT NULL, govt_debt DECIMAL(10, 2) NOT NULL );
Before there were window functions, aggregations were made only with GROUP BY
. A typical question that we could ask our database using SQL is:
What are the top 3 average government debts in percent of the GDP for those countries whose GDP per capita was over 40’000 dollars in every year in the last four years.
Whew. Some (academic) business requirements.
In SQL (PostgreSQL dialect), we would write:
select code, avg(govt_debt) from countries where year > 2010 group by code having min(gdp_per_capita) >= 40000 order by 2 desc limit 3
Or, with inline comments
-- The average government debt select code, avg(govt_debt) -- for those countries from countries -- in the last four years where year > 2010 -- yepp, for the countries group by code -- whose GDP p.c. was over 40'000 in every year having min(gdp_per_capita) >= 40000 -- The top 3 order by 2 desc limit 3
The result being:
code avg ------------ JP 193.00 US 91.95 DE 56.00
Remember the 10 easy steps to a complete understanding of SQL:
FROM
generates the data setWHERE
reduces the generated data setGROUP BY
aggregates the reduced data setHAVING
reduces the aggregated data setSELECT
transforms the reduced aggregated data setORDER BY
sorts the transformed data setLIMIT .. OFFSET
frames the sorted data set
… where LIMIT .. OFFSET
may come in very different flavours.
The empty GROUP BY clause
A very special case of GROUP BY
is the explicit or implicit empty GROUP BY
clause. Here’s a question that we could ask our database:
Are there any countries at all with a GDP per capita of more than 50’000 dollars?
And in SQL, we’d write:
select true answer from countries having max(gdp_per_capita) >= 50000
The result being
answer ------ t
You could of course have used the EXISTS
clause instead (please don’t use COUNT(*)
in these cases):
select exists( select 1 from countries where gdp_per_capita >= 50000 );
And we would get, again:
answer ------ t
… but let’s focus on the plain HAVING
clause.
Not everyone knows that HAVING
can be used all by itself, or what it even means to have HAVING
all by itself. Already the SQL 1992 standard allowed for the use of HAVING
without GROUP BY
, but it wasn’t until the introduction of GROUPING SETS
in SQL:1999, when the semantics of this syntax was retroactively unambiguously defined:
7.10 <having clause>
<having clause> ::= HAVING <search condition>Syntax Rules
1) Let HC be the <having clause>. Let TE be the <table expression> that immediately contains
HC. If TE does not immediately contain a <group by clause>, then GROUP BY ( ) is implicit.
That’s interesting. There is an implicit GROUP BY ( )
, if we leave out the explicit GROUP BY
clause. If you’re willing to delve into the SQL standard a bit more, you’ll find:
<group by clause> ::= GROUP BY <grouping specification> <grouping specification> ::= <grouping column reference> | <rollup list> | <cube list> | <grouping sets list> | <grand total> | <concatenated grouping> <grouping set> ::= <ordinary grouping set> | <rollup list> | <cube list> | <grand total> <grand total> ::= <left paren> <right paren>
So, GROUP BY ( )
is essentially grouping by a “grand total”, which is what’s intuitively happening, if we just look for the highest ever GDP per capita:
select max(gdp_per_capita) from countries;
Which yields:
max -------- 52409.00
The above query is also implicitly the same as this one (which isn’t supported by PostgreSQL):
select max(gdp_per_capita) from countries group by ();
The awesome GROUPING SETs
In this section of the article, we’ll be leaving PostgreSQL land, entering SQL Server land, as PostgreSQL shamefully doesn’t implement any of the following (yet).
Now, we cannot understand the grand total (empty GROUP BY ( )
clause), without having a short look at the SQL:1999 standard GROUPING SETS
. Some of you may have heard of CUBE()
or ROLLUP()
grouping functions, which are just syntactic sugar for commonly used GROUPING SETS
. Let’s try to answer this question in a single query:
What are the highest GDP per capita values per year OR per country
In SQL, we’ll write:
select code, year, max(gdp_per_capita) from countries group by grouping sets ((code), (year))
Which yields two concatenated sets of records:
code year max ------------------------ NULL 2009 46999.00 <- grouped by year NULL 2010 48358.00 NULL 2011 51791.00 NULL 2012 52409.00 CA NULL 52409.00 <- grouped by code DE NULL 44355.00 FR NULL 42578.00 GB NULL 38927.00 IT NULL 36988.00 JP NULL 46548.00 RU NULL 14091.00 US NULL 51755.00
That’s kind of nice, isn’t it? It’s essentially just the same thing as this query with UNION ALL
select code, null, max(gdp_per_capita) from countries group by code union all select null, year, max(gdp_per_capita) from countries group by year;
In fact, it’s exactly the same thing, as the latter explicitly concatenates two sets of grouped records… i.e. two GROUPING SETS
. This SQL Server documentation page also explains it very nicely.
And the most powerful of them all: CUBE()
Now, imagine, you’d like to add the “grand total”, and also the highest value per country AND year, producing four different concatenated sets. To limit the results, we’ll also filter out GDPs of less than 48000 for this example:
select code, year, max(gdp_per_capita), grouping_id(code, year) grp from countries where gdp_per_capita >= 48000 group by grouping sets ( (), (code), (year), (code, year) ) order by grp desc;
This nice-looking query will now produce all the possible grouping combinations that we can imagine, including the grand total, in order to produce:
code year max grp --------------------------------- NULL NULL 52409.00 3 <- grand total NULL 2012 52409.00 2 <- group by year NULL 2010 48358.00 2 NULL 2011 51791.00 2 CA NULL 52409.00 1 <- group by code US NULL 51755.00 1 US 2010 48358.00 0 <- group by code and year CA 2012 52409.00 0 US 2012 51755.00 0 CA 2011 51791.00 0 US 2011 49855.00 0
And because this is quite a common operation in reporting and in OLAP, we can simply write the same by using the CUBE()
function:
select code, year, max(gdp_per_capita), grouping_id(code, year) grp from countries where gdp_per_capita >= 48000 group by cube(code, year) order by grp desc;
Compatibility
While the first couple of queries also worked on PostgreSQL, the ones that are using GROUPING SETS
will work only on 4 out of 17 RDBMS currently supported by jOOQ. These are:
- DB2
- Oracle
- SQL Server
- Sybase SQL Anywhere
jOOQ also fully supports the previously mentioned syntaxes. The GROUPING SETS
variant can be written as such:
// Countries is an object generated by the jOOQ // code generator for the COUNTRIES table. Countries c = COUNTRIES; ctx.select( c.CODE, c.YEAR, max(c.GDP_PER_CAPITA), groupingId(c.CODE, c.YEAR).as("grp")) .from(c) .where(c.GDP_PER_CAPITA.ge(new BigDecimal("48000"))) .groupBy(groupingSets(new Field[][] { {}, { c.CODE }, { c.YEAR }, { c.CODE, c.YEAR } })) .orderBy(fieldByName("grp").desc()) .fetch();
… or the CUBE()
version:
ctx.select( c.CODE, c.YEAR, max(c.GDP_PER_CAPITA), groupingId(c.CODE, c.YEAR).as("grp")) .from(c) .where(c.GDP_PER_CAPITA.ge(new BigDecimal("48000"))) .groupBy(cube(c.CODE, c.YEAR)) .orderBy(fieldByName("grp").desc()) .fetch();
… and in the future, we’ll emulate GROUPING SETS
by their equivalent UNION ALL
queries in those databases that do not natively support GROUPING SETS
.
Reference: | Do You Really Understand SQL’s GROUP BY and HAVING clauses? from our JCG partner Lukas Eder at the JAVA, SQL, AND JOOQ blog. |