Please, Run That Calculation in Your RDBMS
There’s one thing that you can do terribly wrong when working with RDBMS. And that thing is not running your calculations in the database, when you should.
We’re not advocating to blindly move all business logic into the database, but when I see a Stack Overflow question like this, I feel the urge to gently remind you of the second item in our popular 10 Common Mistakes Java Developers Make When Writing SQL.
The Stack Overflow question essentially boils down to this (liberally quoted):
From the following medium-sized table, I wish to count the number of documents with status 0 or 1 per application ID:
AppID | DocID | DocStatus ------+-------+---------- 1 | 100 | 0 1 | 101 | 1 2 | 200 | 0 2 | 300 | 1 ... | ... | ...Should I use Hibernate for that?
And the answer: NO! Don’t use Hibernate for that (unless you mean native querying). You should use SQL for that. Es-Queue-El! You have so many trivial options to make your SQL Server help you run this query in a fraction of the time it would take if you loaded all that data into Java memory before aggregating!
For instance (using SQL Server):
Using GROUP BY
This is the most trivial one, but it might not return result in exactly the way you wanted, i.e. different aggregation results are in different rows:
SELECT [AppID], [DocStatus], count(*) FROM [MyTable] GROUP BY [AppID], [DocStatus]
Example on SQLFiddle, returning something like
| APPID | DOCSTATUS | COLUMN_2 | |-------|-----------|----------| | 1 | 0 | 2 | | 2 | 0 | 3 | | 1 | 1 | 3 | | 2 | 1 | 2 |
Using nested selects
This is probably the solution that this particular user was looking for. They probably want each aggregation in a separate column, and one very generic way to achieve this is by using nested selects. Note that this solution might prove to be a bit slow in some databases that have a hard time optimising these things
SELECT [AppID], (SELECT count(*) FROM [MyTable] [t2] WHERE [t1].[AppID] = [t2].[AppID] AND [DocStatus] = 0) [Status_0], (SELECT count(*) FROM [MyTable] [t2] WHERE [t1].[AppID] = [t2].[AppID] AND [DocStatus] = 1) [Status_1] FROM [MyTable] [t1] GROUP BY [AppID]
Example on SQLFiddle, returning something like
| APPID | STATUS_0 | STATUS_1 | |-------|----------|----------| | 1 | 2 | 3 | | 2 | 3 | 2 |
Using SUM()
This solution is probably the optimal one. It is equivalent to the previous one with nested selects, although it only works for simple queries, whereas the nested selects version is more versatile.
SELECT [AppID], SUM(IIF([DocStatus] = 0, 1, 0)) [Status_0], SUM(IIF([DocStatus] = 1, 1, 0)) [Status_1] FROM [MyTable] [t1] GROUP BY [AppID]
Example on SQLFiddle, same result as before
Using PIVOT
This solution is for the SQL Aficionados among yourselves. It uses the T-SQL PIVOT
clause!
SELECT [AppID], [0], [1] FROM ( SELECT [AppID], [DocStatus] FROM [MyTable] ) [t] PIVOT ( count([DocStatus]) FOR [DocStatus] IN ([0], [1]) ) [pvt]
SQL aficionados use PIVOT
Example on SQLFiddle, same result as before
Conclusion
You may freely choose your weapon among the above suggestions, and I’m sure there are more alternatives. All of them will outperform any Java-based aggregation implementation by orders of magnitude, even for trivially small data sets for sure. We’ll say this time and again, and we’ll quote Gavin King time and again for the same thing:
Just because you’re using Hibernate, doesn’t mean you have to use it for everything. A point I’ve been making for about ten years now.
And in our words:
Use SQL whenever appropriate! And that is much more often than you might think!
I agree completely with the setiment in this article: get the database engine to aggregate for you, because that’s what it’s good at. It would be great to see this article extednded with a discussion of execution plans, especially when comparing three different queries that produce the same result. Don’t just say “this solution might prove to be a bit slow” – prove it with a plan! SQL Server 2012 on sqlfiddle is down right now (“can’t connect to datasource [sqlfiddle_mssql2]”). Will try again later to reproduce your queries and plans. “SQL aficionados use PIVOT” – only as a last… Read more »
Yes, that’s a good idea! I’ll do this in a follow-up blog post. It’ll be good to compare SQL Server, Oracle and 1-2 others, maybe PostgreSQL and MySQL.