How to Find the Closest Subset Sum with SQL
I’ve stumbled upon this very interesting question on Stack Overflow, recently. Its title is:
In this article, we’ll compare the user’s imperative approach to the extremely elegant (Oracle) SQL approach. We’ll be making use of any combination of these awesome SQL features:
- Window functions
- FIRST and LAST functions, using Oracle-specific syntax
LATERAL JOIN
(orAPPLY
- Recursive SQL
The problem
The user alhashmiya who had asked this question, was looking for a solution to the problem of finding the “closest” sum of elements in a subset of numbers A to a set of “expected” sums B. More concretely, alhasmiya had the following two tables:
ID ASSIGN_AMT -------------- 1 25150 2 19800 3 27511
And…
ID WORK_AMT ------------ 1 7120 2 8150 3 8255 4 9051 5 1220 6 12515 7 13555 8 5221 9 812 10 6562
The ASSIGN_AMT
value is the “expected” sum. What alhashmiya was looking for is the sum of a subset of WORK_AMT
values A, such that this sum is as close as possible to any of the “expected” sums. There are two ways to understand this problem:
- The possible “closest” sums are restricted to be the sums obtained in a strictly defined order, e.g. ordered by
ID
. An application of this understanding is to find out the exact moment when a well-defined, ordered running total (e.g. bank account balance) has exceeded a certain threshold - The possible “closest” sums are unrestricted. Any unordered subset qualifies to calculate such a sum. An application of this understanding is to find a combination of discrete values to reach a target value as closely as possible, e.g. to optimise a trade.
The second understanding is called the “subset sum problem”, for which there are only exponential algorithms if you’re looking for an exact solution. It is important to notice that this algorithm will NOT scale well at all, regardless of the solution technique!
But let’s look at the simpler understanding first:
1. Calculating a sum of an ordered subset of values
By strictly ordered we mean (in the sense of the question) that we want to order all WORK_AMT
values, e.g. by ID
, and allow only for sums that can appear in this particular order. I.e.
ID WORK_AMT SUBSET_SUM ------------------------ 1 7120 7120 (= WORK_AMT) 2 8150 15270 (= previous SUBSET_SUM + WORK_AMT) 3 8255 23525 (= previous SUBSET_SUM + WORK_AMT) 4 9051 32576 (= ...) 5 1220 33796 6 12515 46311 7 13555 59866 8 5221 65087 9 812 65899 10 6562 72461
The above SUBSET_SUM
value is the sum of all WORK_AMT
values with ID <= "current" ID
. We’ve seen this before on this blog, this is called a running total, and it is best calculated using window functions as follows:
WITH VALS (ID, WORK_AMT) AS ( SELECT 1 , 7120 FROM DUAL UNION ALL SELECT 2 , 8150 FROM DUAL UNION ALL SELECT 3 , 8255 FROM DUAL UNION ALL SELECT 4 , 9051 FROM DUAL UNION ALL SELECT 5 , 1220 FROM DUAL UNION ALL SELECT 6 , 12515 FROM DUAL UNION ALL SELECT 7 , 13555 FROM DUAL UNION ALL SELECT 8 , 5221 FROM DUAL UNION ALL SELECT 9 , 812 FROM DUAL UNION ALL SELECT 10, 6562 FROM DUAL ) SELECT ID, WORK_AMT, SUM (WORK_AMT) OVER (ORDER BY ID) AS SUBSET_SUM FROM VALS ORDER BY ID
The above window function calculates the sum of all WORK_AMT
values that are in the “window” of values where the ID
is smaller or equal to the current ID
.
Finding the “closest” of these sums with quantified comparison predicates
Now, the task at hand is to find for each value ASSIGN_AMT
in 25150
, 19800
, and 27511
the closest value of SUBSET_SUM
. In a way, what we are trying to do is we want to minimise the expression ABS(SUBSET_SUM - ASSIGN_AMT)
.
However, the MIN()
aggregate function won’t help us here, because that will simply return the minimum value of this difference. We want the value of SUBSET_SUM
that produces this difference in the first place.
One solution would be to use a quantified comparison predicate, a rarely used and not well-known comparison operator that works in all SQL databases:
-- The common table expressions are the same as -- in the previous examples WITH ASSIGN(ID, ASSIGN_AMT) AS ( SELECT 1, 25150 FROM DUAL UNION ALL SELECT 2, 19800 FROM DUAL UNION ALL SELECT 3, 27511 FROM DUAL ), VALS (ID, WORK_AMT) AS ( SELECT 1 , 7120 FROM DUAL UNION ALL SELECT 2 , 8150 FROM DUAL UNION ALL SELECT 3 , 8255 FROM DUAL UNION ALL SELECT 4 , 9051 FROM DUAL UNION ALL SELECT 5 , 1220 FROM DUAL UNION ALL SELECT 6 , 12515 FROM DUAL UNION ALL SELECT 7 , 13555 FROM DUAL UNION ALL SELECT 8 , 5221 FROM DUAL UNION ALL SELECT 9 , 812 FROM DUAL UNION ALL SELECT 10, 6562 FROM DUAL ), SUMS (ID, WORK_AMT, SUBSET_SUM) AS ( SELECT VALS.*, SUM (WORK_AMT) OVER (ORDER BY ID) FROM VALS ) -- This is now the interesting part, where we -- calculate the closest sum SELECT ASSIGN.ID, ASSIGN.ASSIGN_AMT, SUBSET_SUM FROM ASSIGN JOIN SUMS ON ABS (ASSIGN_AMT - SUBSET_SUM) <= ALL ( SELECT ABS (ASSIGN_AMT - SUBSET_SUM) FROM SUMS )
The quantified comparison predicate reads intuitively. We’re looking for that specific SUBSET_SUM
whose difference to the “expected” ASSIGN_AMT
is smaller or equal to all the other possible differences.
The above query yields:
ID ASSIGN_AMT SUBSET_SUM -------------------------- 1 25150 23525 2 19800 23525 3 27511 23525
In this case, it’s always the same.
You may have noticed that the solution is not entirely correct in the event where ASSIGN_AMT
is allowed to be zero (let’s ignore the possibility of negative values) – in case of which we’ll produce duplicate values in the JOIN
. This can be achieved when replacing:
UNION ALL SELECT 4 , 0 FROM DUAL
Now, the result is:
ID ASSIGN_AMT SUBSET_SUM -------------------------- 1 25150 23525 2 19800 23525 2 19800 23525 3 27511 23525
One solution is to remove those duplicates using DISTINCT
(which is an anti-pattern. See #6 in this list). A better solution is to make the JOIN
predicate unambiguous by comparing ID
values as well, i.e.:
SELECT ASSIGN.ID, ASSIGN.ASSIGN_AMT, SUBSET_SUM FROM ASSIGN JOIN SUMS ON (ABS (ASSIGN_AMT - SUBSET_SUM), SUMS.ID) <= ALL ( SELECT ABS (ASSIGN_AMT - SUBSET_SUM), ID FROM SUMS )
The above unfortunately doesn’t work in Oracle (yet), which will report an error:
ORA-01796: this operator cannot be used with lists
Oracle supports comparing tuples / row value expressions only with equal comparators, not with less than / greater than comparators – which is a shame. The same query runs smoothlessly in PostgreSQL.
Finding the “closest” of these sums with Oracle’s FIRST function
Oracle has a very interesting function to keep the first (or last) values in a set of aggregate values of a group given any particular ordering, and calculating an aggregate function only on those values within the group. The following SQL statement will illustrate this:
SELECT ASSIGN.ID, ASSIGN.ASSIGN_AMT, MIN (SUBSET_SUM) KEEP ( DENSE_RANK FIRST ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM) ) AS CLOSEST_SUM FROM ASSIGN CROSS JOIN SUMS GROUP BY ASSIGN.ID, ASSIGN.ASSIGN_AMT
Essentially, we’re grouping all values from the SUMS
table for each ASSIGN_AMT
. For each of those groups, we’ll look for the "FIRST"
row that appears when ordering rows within the group by our criteria ABS(ASSIGN_AMT - SUBSET_SUM)
. We "KEEP"
only those rows in the group, and retain from those rows the minimum SUBSET_SUM
.
This query will again yield:
ID ASSIGN_AMT SUBSET_SUM -------------------------- 1 25150 23525 2 19800 23525 3 27511 23525
This is an extremely nice functionality that can come in handy every once in a while.
Remember that we’ve seen a similar feature recently on this blog, when we were looking for FIRST_VALUE
(or LAST_VALUE
) within the PARTITION
of a window. In standard SQL, a similar thing can be achieved using window functions as such:
SELECT DISTINCT ASSIGN.ID, ASSIGN.ASSIGN_AMT, FIRST_VALUE (SUBSET_SUM) OVER ( PARTITION BY ASSIGN.ID ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM) ) AS CLOSEST_SUM FROM ASSIGN CROSS JOIN SUMS
Unfortunately, these solutions all produce duplicates, which we have to remove either via GROUP BY
(KEEP
solution), or via DISTINCT
(FIRST_VALUE
solution).
Finding the “closest” of these sums with LATERAL JOIN
A cleaner solution that doesn’t rely on the removal of duplicates is using Oracle 12c’s new FETCH FIRST
clause along with CROSS JOIN LATERAL
(or CROSS APPLY
, which is the same)
SELECT ASSIGN.ID, ASSIGN.ASSIGN_AMT, CLOSEST_SUM FROM ASSIGN CROSS JOIN LATERAL ( SELECT SUBSET_SUM AS CLOSEST_SUM FROM SUMS ORDER BY ABS (ASSIGN.ASSIGN_AMT - SUBSET_SUM) FETCH FIRST 1 ROW ONLY ) SUMS
What does this mean? We’re essentially joining for each value in ASSIGN
only the FIRST 1 ROW
in SUMS
, ordered by the usual criteria. We need LATERAL
(or APPLY
), because this allows us to reference columns from the left side of the JOIN
expression also in the right side, which wouldn’t be possible otherwise.
The same functionality is supported in SQL Server (only using CROSS APPLY
), or in PostgreSQL (only using CROSS JOIN LATERAL
).
Lateral can be very useful whenever the right hand side of a JOIN
depends on the left hand side. Unlike ordinary joins, this means that the JOIN
order will be set in stone from left to right, and the optimiser has a reduced set of join algorithm options. This is useful in examples like these (with ORDER BY
and FETCH FIRST
), or when joining unnested table-valued functions, which we’ll cover in a future blog post.
2. Calculating a sum of any subset of values
So far, we’ve worked on a simplified version of the problem. This is probably not what alhashmiya meant on their Stack Overflow question. They probably wanted to solve the Subset sum problem, finding the “closest” sum of any subset of WORK_AMT
values.
We’ll use recursive SQL to calculate all the possible sums. First off, let’s remember how recursive SQL works:
In recursive SQL, we need a UNION ALL
query in a common table expression (WITH
clause in Oracle, or WITH RECURSIVE
clause in PostgreSQL). The first subquery of UNION ALL
generates the “seed row(s)” of the recursion, and the second subqeury of UNION ALL
generates the recursion based on a SELECT
that selects from the table being declared, recursively.
So, a naive solution to this subset sum problem can be seen here:
-- Repetition of the previous data WITH ASSIGN (ID, ASSIGN_AMT) AS ( SELECT 1, 25150 FROM DUAL UNION ALL SELECT 2, 19800 FROM DUAL UNION ALL SELECT 3, 27511 FROM DUAL ), WORK (ID, WORK_AMT) AS ( SELECT 1 , 7120 FROM DUAL UNION ALL SELECT 2 , 8150 FROM DUAL UNION ALL SELECT 3 , 8255 FROM DUAL UNION ALL SELECT 4 , 9051 FROM DUAL UNION ALL SELECT 5 , 1220 FROM DUAL UNION ALL SELECT 6 , 12515 FROM DUAL UNION ALL SELECT 7 , 13555 FROM DUAL UNION ALL SELECT 8 , 5221 FROM DUAL UNION ALL SELECT 9 , 812 FROM DUAL UNION ALL SELECT 10, 6562 FROM DUAL ), -- A new way of calculating all possible sums by -- Recursively adding up all the sums SUMS (SUBSET_SUM, MAX_ID) AS ( SELECT WORK_AMT, ID FROM WORK UNION ALL SELECT WORK_AMT + SUBSET_SUM, GREATEST (MAX_ID, WORK.ID) FROM SUMS JOIN WORK ON SUMS.MAX_ID < WORK.ID ) -- The same technique to match the "closest" sum -- As before SELECT ASSIGN.ID, ASSIGN.ASSIGN_AMT, MIN (SUBSET_SUM) KEEP ( DENSE_RANK FIRST ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM) ) AS CLOSEST_SUM FROM SUMS CROSS JOIN ASSIGN GROUP BY ASSIGN.ID, ASSIGN.ASSIGN_AMT
The recursion is simple. In the first subquery of the recursion (the “seed row”), we select each individual row in WORK
:
SELECT WORK_AMT, ID FROM WORK
In the second subquery of the recursion (the “recusion rows”), we join the value of the previous recursion step (SUMS
) with all the remaining values (WORK
), i.e. all the values that have a higher ID
:
SELECT WORK_AMT + SUBSET_SUM, GREATEST (MAX_ID, WORK.ID) FROM SUMS JOIN WORK ON SUMS.MAX_ID < WORK.ID
In this combination, we calculate the intermediate sum (which is also a running total, by the way) and we calculate the highest summed-up ID thus far, to reduce the number of combinations. The latter, we can do because summing is commutative.
The main difference in this solution compared to previous approaches is the fact that we’re now generating a lot (a huge lot) more different values in the SUMS
table.
After a still acceptable 0.112s for 10 different WORK_AMT
values, the database calculated:
ID ASSIGN_AMT CLOSEST_SUM --------------------------- 1 25150 25133 2 19800 19768 3 27511 27488
But beware, as soon as you start adding values to the VALS
table, this algorithm starts exploding in time and space complexity. Running the same query with the following, only slightyl bigger WORK
table already requires 16.3 seconds to yield a result:
WORK(ID, WORK_AMT) AS ( SELECT 1 , 7120 FROM DUAL UNION ALL SELECT 2 , 8150 FROM DUAL UNION ALL SELECT 3 , 8255 FROM DUAL UNION ALL SELECT 4 , 9051 FROM DUAL UNION ALL SELECT 5 , 1220 FROM DUAL UNION ALL SELECT 6 , 12515 FROM DUAL UNION ALL SELECT 7 , 13555 FROM DUAL UNION ALL SELECT 8 , 5221 FROM DUAL UNION ALL SELECT 9 , 812 FROM DUAL UNION ALL SELECT 10, 6562 FROM DUAL UNION ALL SELECT 11, 1234 FROM DUAL UNION ALL SELECT 12, 61 FROM DUAL UNION ALL SELECT 13, 616 FROM DUAL UNION ALL SELECT 14, 2456 FROM DUAL UNION ALL SELECT 15, 5161 FROM DUAL UNION ALL SELECT 16, 414 FROM DUAL UNION ALL SELECT 17, 516 FROM DUAL UNION ALL SELECT 18, 617 FROM DUAL UNION ALL SELECT 19, 146 FROM DUAL ),
And the result would be:
ID ASSIGN_AMT CLOSEST_SUM --------------------------- 1 25150 25150 2 19800 19800 3 27511 27511
Want proof about the actual sum? That’s easy as well with recursive SQL:
-- Repetition of the previous data WITH ASSIGN (ID, ASSIGN_AMT) AS ( SELECT 1, 25150 FROM DUAL UNION ALL SELECT 2, 19800 FROM DUAL UNION ALL SELECT 3, 27511 FROM DUAL ), WORK (ID, WORK_AMT) AS ( SELECT 1 , 7120 FROM DUAL UNION ALL SELECT 2 , 8150 FROM DUAL UNION ALL SELECT 3 , 8255 FROM DUAL UNION ALL SELECT 4 , 9051 FROM DUAL UNION ALL SELECT 5 , 1220 FROM DUAL UNION ALL SELECT 6 , 12515 FROM DUAL UNION ALL SELECT 7 , 13555 FROM DUAL UNION ALL SELECT 8 , 5221 FROM DUAL UNION ALL SELECT 9 , 812 FROM DUAL UNION ALL SELECT 10, 6562 FROM DUAL ), -- A new way of calculating all possible sums by -- Recursively adding up all the sums SUMS (SUBSET_SUM, MAX_ID, CALC) AS ( SELECT WORK_AMT, ID, TO_CHAR(WORK_AMT) FROM WORK UNION ALL SELECT WORK_AMT + SUBSET_SUM, GREATEST (MAX_ID, WORK.ID), CALC || '+' || WORK_AMT FROM SUMS JOIN WORK ON SUMS.MAX_ID < WORK.ID ) -- The same technique to match the "closest" sum -- As before SELECT ASSIGN.ID, ASSIGN.ASSIGN_AMT, MIN (SUBSET_SUM) KEEP ( DENSE_RANK FIRST ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM) ) AS CLOSEST_SUM, MIN (CALC) KEEP ( DENSE_RANK FIRST ORDER BY ABS (ASSIGN_AMT - SUBSET_SUM) ) AS CALCULATION FROM SUMS CROSS JOIN ASSIGN GROUP BY ASSIGN.ID, ASSIGN.ASSIGN_AMT
The above now yields:
ID ASSIGN_AMT CLOSEST_SUM CALCULATION ------------------------------------------------------------ 1 25150 25150 7120 + 8150 + 9051 + 812 2 19800 19800 1220 + 12515 + 5221 + 812 3 27511 27511 8150 + 8255 + 9051 + 1220 + 812
Conclusion
SQL is powerful. Extremely powerful. In this article, we’ve used various techniques to calculate the subset sum problem, or a simplification thereof. We’ve shown how to solve this problem in Oracle or PostgreSQL using a combination of these awesome SQL features:
- Window functions
KEEP FIRST
(in Oracle only)LATERAL JOIN
(orAPPLY
- Recursive SQL
Reference: | How to Find the Closest Subset Sum with SQL from our JCG partner Lukas Eder at the JAVA, SQL, AND JOOQ blog. |