Optimizing Neo4j Cypher queries
Last week, I spent a good number of hours trying to optimize around 20 Cypher queries that were performing disastrously(36866ms to 155575ms) with data from a live system. After some trial and error, and a lot of input from Michael, I was able to figure out generally what needed to be done to the queries to make them perform better- at the end of this, the worst performing query dropped to 521ms on a cold graph and 1GB heap space (and that query has optional paths- not quite sure how to improve that), and the remaining were all under 50ms- quite a large improvement from original numbers.
Hoping that this might help someone else, here is what I did (mostly guess work and largely unscientific)- perhaps Michael can help explain the internals and correct me on some assumptions. The first thing I did was make sure that every cypher query is using parameters- as stated in the Neo4j documentation, this helps in caching of execution plans.
Second, I came across a post in the Neo4j mailing list, where Michael mentioned NOT re-instantiating the ExecutionEngine so that the above said parameterized queries actually do cache. This might seem obvious to many, but it was a fact that easily slipped by, given I have a class called QueryExecutor that contains a method to execute a query with map of parameters, and the method created a new ExecutionEngine for every query. Once this method was written and used many, many times over, it was very easy to forget about. However, this was a very important factor in the overall performance (a mention in the docs would be extremely helpful), and it explained why my queries generally took the same time to execute even when parameterized. Changing this to use a cached ExecutionEngine saw the bottom half of my query time worksheet drop off…0 to 1 ms after being cached- excellent progress.
Now, to get to each query, starting with the worst. I decided to optimize on my local machine, with only 1GB heap space allocated, and on a cold graph. So I ignore the improvement in the query execution after caching- I feel that this is a better way to ensure progress- if the first query hit isn’t improving, then you really haven’t optimized it. That way, if it works really well with limited hardware, I am more confident of it working better in production.
Apart from timing the query in the code, I optimized using the webadmin console. The indicator that the query was awful is that it just would not return and the console would hang. Optimizing so that it didn’t hang was a major improvement in itself. Highly unscientific, but I recommend it.
My first query averaged about 76558ms (the time was obtained by putting in a start and end time around the engine execute method). After the first optimization, it was down to 466ms. Here is the original query:https://gist.github.com/4436272
And this is the optimized one:https://gist.github.com/4436281 There was no need to do this huge match only to filter out results based on the alertDate property on a, so I reduced the match to return the smallest set of data that could be filtered first ie. the path to a.
If you were to execute the bad query up to the first match, you would see something like 20K rows returned. After filtering, they are a mere 450 odd rows. So as you can imagine, the second query quickly trims down the number of results you are potentially working with.
The second change was something I learnt from Michael on the day, when I asked whether it makes sense to do a giant match or keep pruning with subqueries. His response was ‘The question is, are you using match to actually expand the result set by describing patterns or do you just want to make certain that some relationships are there (or not) aka FILTERING. If the latter, you might want to use the path-expression syntax in a where clause.
WHERE a-[:FOO]->b or WHERE NOT(a-[:FOO]->b)'
This took a little getting used to as the way I wrote the MATCH clause was exactly how I’d think of it in my head- but now I am able to differentiate between a match for results required versus match to filter. In the query above, I need (ir) in my results, so there is no need to include (a)-[:alert_for_inspection]->(i) in the match; I can just use it in the WHERE to make sure that a does indeed relate to i.
Here is another example: https://gist.github.com/4436293
Straight away, you can see that we filter the cm relations by date- there is no need for us to first go and match all sorts of stuff if they don’t fall in the date range. So this part of the query can be rewritten to:
start c=node:companies(id={id}) match c <-[parent_company*0..]-(n) with n match n-[:defines_business_process]->(bp)-[:has_cycle]->(cm) where cm.measureDate>={measureStartDate} and cm.measureDate<={measureEndDate}
After that, the next filter is on a- same principle applies:
with cm match (cm)-[:cycle_metric] ->m-[:metric_activity] ->ma-[:metric_unit] -> (u)-[:alert_for_unit]-(a) where a.alertDate=cm.measureDate and a.fromEntityType={type}
This further prunes our result set. Finally add the connections to r (for our results), making sure that paths that do not lead to r but are necessary go into the WHERE clause:
with a,ma match (r) < - [:for_inspection_result]-a-[:alert_for_inspection]- > i where (i) < -[:metric_inspection]-(ma) return a.id as alertId, r.value as resultValue
Here is the full query: https://gist.github.com/4436328 Original time- 33360ms, after optimization- 246ms.
At least for me, most of my queries fell into this pattern, so by day 2, I was able to refactor them very quickly. The interesting thing is, after this, I still felt the response sluggish but the query times printed out in my log were tiny. After elimination, I found that my code actually stuck for a very long time after the query had executed (by executionEngine.execute) but within the first iteration over the resultset.I assume that the results are not collected necessarily during the execute() method, but lazily fetched upon iteration of the resultset- I do not know Cypher internals so I might be completely wrong. But timing the iteration itself serves to point out even more badly written queries.
The other bits and pieces are- ORDER BY adds on a lot of time. If you can do without it, this is the first thing you should drop. DISTINCT also adds time, but in many of my cases, it was difficult to drop.
If you want to check the absence of optional paths, where I typically do MATCH (u)-[r?:has_a]- (a) WHERE NOT (r is null), instead, re-write as MATCH (u)-[other_stuff]-.. WHERE NOT(u-[:has_a]-a) and it performs much better. However where I have optional paths like MATCH X-[o?:optional]-Y WHERE (o is present, match Y to A and B) OR (o is absent, match X to c and d), I was unable to simplify and these queries still take some time as compared to others without optional paths.
Finally- the problem was discovered so late because test data is never quite close to live data. The structure of the graph played a big role- some nodes were heavily connected, others, not so much- and the queries involving those heavily connected nodes were the ones that hurt us most. Try and use production quality data as far as possible to performance test or try and create test data as similar to it.
So, to summarize:
- Always parameterize your queries
- Cache the ExecutionEngine
- Figure out what you’re filtering on- try and apply that filter as early as possible with the least possible matches involved, so that your resultset gets progressively smaller as you move further into the query. Keep measuring both timing and results returned in each subquery so you can decide what goes first when the filter is not obvious
- Examing your MATCH and RETURN clauses- include in the MATCH only those pieces that are required in RETURN. The remaining which would be to filter the results can go into the WHERE
- If you don’t need ORDER BY, drop it yesterday
- If you don’t need DISTINCT, get rid of it too
- Checking the absence/presence of optional paths can be moved from the MATCH into WHERE if you don’t need fancy filtering stuff based on it
- Time not only the execute() on the query, but also the time to iterate through the results
- If your webadmin console hangs, you have done a Bad Thing. Drop various parts of your query to figure out the offending one.
- Try to use live data as far as possible
- Test on a cold graph with miserly resources- you feel so much better when you see it zip past you on production!
Reference: Optimizing Neo4j Cypher queries from our JCG partner Aldrin & Luanne Misquitta at the Thought Bytes blog.