Leveraging new features in Drill 1.2 – ANSI SQL Analytic/Window functions
Today we are very excited to announce the latest version of Apache Drill, 1.2,) as part of the MapR distribution. Drill 1.2 packages for MapR can be downloaded http://doc.mapr.com/display/MapR/Apache+Drill+on+MapR
You can experiment with the release using MapR sandbox and various hands on tutorials at https://www.mapr.com/products/mapr-sandbox-hadoop/download-sandbox-drill
Apache Drill has been gaining significant user adoption and community momentum since its initial Beta availability (Sep’14). Numerous customers have deployed and using Drill in production and they are finding Drill immensely valuable in their environments with a variety of use cases such as Data exploration, Adhoc queries/BI on Hadoop Data lake and JSON data analytics.
The generally available version of Drill, 1.0 was released in May’15 followed by Drill 1.1 in early July’15. Each of these releases add significant new features to Drill’s interactive self service data exploration and adhoc SQL query capabilities and making it enterprise ready in terms of scale and manageability. Drill 1.2 extends on the foundation and raises the bar with advanced SQL support, deeper Hive integration and performance enhancements. Drill 1.2 includes over ~250 bug fixes and several new enhancements including the following.
- New ANSI SQL Analytic/Window functions – Lead/Lag, First_Value/Last_Value, NTile
- Optimized read capabilities on Hive tables
- Support for multiple Hive versions
- Metadata caching to improve query performance on large # parquet files
- Improved rowkey pushdown to HBase/MapR-DB tables
- Drill web UI security
- Drop table command
- Memory handling improvements
In this blog post, I would like to briefly introduce the new analytics capabilities added to Drill namely ANSI SQL compliant Analytic and Window functions and how to get started with these. SQL window functions in Drill include support for PARTITION BY and OVER clauses, variety of aggregated window functions for Sum, Max, Min, Count, Avg and analytic functions such as First_Value, Last_Value, Lead, Lag, NTile, Row_Number, Rank. Window functions are highly versatile and let users cut down on the joins, subqueries, explicit cursors that need to be written and fit naturally to solve a variety of use cases without a lot of coding effort.
In my previous posts Turn raw data to real insights and working with highly dynamic datasets , I have demonstrated various query capabilities in Drill using a sample business reviews demo dataset by Yelp. This post continues to leverage the same dataset to showcase the analytic/window functions.
First, let us start Drill in embedded mode (can use distributed mode as well)
NRentachintala-MAC:bin nrentachintala$ ./drill-embedded Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Oct 19, 2015 9:20:03 AM org.glassfish.jersey.server.ApplicationHandler initialize INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26... apache drill 1.2.0 "a drill in the hand is better than two in the bush"
List the available schemas in Drill.
0: jdbc:drill:zk=local> show schemas; +---------------------+ | SCHEMA_NAME | +---------------------+ | INFORMATION_SCHEMA | | cp.default | | dfs.default | | dfs.root | | dfs.tmp | | dfs.yelp | | sys | +---------------------+ 7 rows selected (1.755 seconds)
Switch to using the workspace in which Yelp data is loaded.
0: jdbc:drill:zk=local> use dfs.yelp; +-------+---------------------------------------+ | ok | summary | +-------+---------------------------------------+ | true | Default schema changed to [dfs.yelp] | +-------+---------------------------------------+ 1 row selected (0.129 seconds)
Let us start with exploring one of the datasets available in Yelp dataset – the business information
0: jdbc:drill:zk=local> select * from `business.json` limit 1; +-------------+--------------+-------+------+------------+------+--------------+------+-----------+-------+-------+----------+------------+------+---------------+ | business_id | full_address | hours | open | categories | city | review_count | name | longitude | state | stars | latitude | attributes | type | neighborhoods | +-------------+--------------+-------+------+------------+------+--------------+------+-----------+-------+-------+----------+------------+------+---------------+ | vcNAWiLM4dR7D2nwwJ7nCA | 4840 E Indian School Rd Ste 101 Phoenix, AZ 85018 | {"Tuesday":{"close":"17:00","open":"08:00"},"Friday":{"close":"17:00","open":"08:00"},"Monday":{"close":"17:00","open":"08:00"},"Wednesday":{"close":"17:00","open":"08:00"},"Thursday":{"close":"17:00","open":"08:00"},"Sunday":{},"Saturday":{}} | true | ["Doctors","Health & Medical"] | Phoenix | 7 | Eric Goldberg, MD | -111.983758 | AZ | 3.5 | 33.499313 | {"By Appointment Only":true,"Good For":{},"Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} | business | [] | +-------------+--------------+-------+------+------------+------+--------------+------+-----------+-------+-------+----------+------------+------+---------------+ 1 row selected (0.514 seconds)
Now let us examine usage of few Drill window functions.
First , simply get the top Yelp businesses based on # reviews in each city along with row number.
0: jdbc:drill:zk=local> SELECT name, city, review_count,row_number() . . . . . . . . . . . > OVER (PARTITION BY city ORDER BY review_count DESC) as rownum . . . . . . . . . . . > FROM `business.json` limit 15; +----------------------------------------+------------+---------------+---------+ | name | city | review_count | rownum | +----------------------------------------+------------+---------------+---------+ | Cupz N' Crepes | Ahwatukee | 124 | 1 | | My Wine Cellar | Ahwatukee | 98 | 2 | | Kathy's Alterations | Ahwatukee | 12 | 3 | | McDonald's | Ahwatukee | 7 | 4 | | U-Haul | Ahwatukee | 5 | 5 | | Hi-Health | Ahwatukee | 4 | 6 | | Healthy and Clean Living Environments | Ahwatukee | 4 | 7 | | Active Kids Pediatrics | Ahwatukee | 4 | 8 | | Roberto's Authentic Mexican Food | Anthem | 117 | 1 | | Q to U BBQ | Anthem | 74 | 2 | | Outlets At Anthem | Anthem | 64 | 3 | | Dara Thai | Anthem | 56 | 4 | | Cafe Provence | Anthem | 53 | 5 | | Shanghai Club | Anthem | 50 | 6 | | Two Brothers Kitchen | Anthem | 43 | 7 | +----------------------------------------+------------+---------------+---------+ 15 rows selected (0.67 seconds)
Check how the # reviews for each business compared to the average # reviews across all business in the city.
0: jdbc:drill:zk=local> SELECT name, city,review_count, . . . . . . . . . . . > Avg(review_count) OVER (PARTITION BY City) AS city_reviews_avg . . . . . . . . . . . > FROM `business.json`limit 15; +----------------------------------------+------------+---------------+---------------------+ | name | city | review_count | city_reviews_avg | +----------------------------------------+------------+---------------+---------------------+ | Hi-Health | Ahwatukee | 4 | 32.25 | | My Wine Cellar | Ahwatukee | 98 | 32.25 | | U-Haul | Ahwatukee | 5 | 32.25 | | Cupz N' Crepes | Ahwatukee | 124 | 32.25 | | McDonald's | Ahwatukee | 7 | 32.25 | | Kathy's Alterations | Ahwatukee | 12 | 32.25 | | Healthy and Clean Living Environments | Ahwatukee | 4 | 32.25 | | Active Kids Pediatrics | Ahwatukee | 4 | 32.25 | | Anthem Community Center | Anthem | 4 | 14.492063492063492 | | Scrapbooks To Remember | Anthem | 4 | 14.492063492063492 | | Hungry Howie's Pizza | Anthem | 7 | 14.492063492063492 | | Pinata Nueva | Anthem | 3 | 14.492063492063492 | | Starbucks Coffee Company | Anthem | 13 | 14.492063492063492 | | Pizza Hut | Anthem | 6 | 14.492063492063492 | | Rays Pizza | Anthem | 19 | 14.492063492063492 | +----------------------------------------+------------+---------------+---------------------+ 15 rows selected (0.395 seconds)
Check how the #reviews for each business contribute to the total # of reviews for all businesses in the city.
0: jdbc:drill:zk=local> SELECT name, city,review_count, . . . . . . . . . . . > Sum(review_count) OVER (PARTITION BY City) AS city_reviews_sum . . . . . . . . . . . > FROM `business.json`limit 15; +----------------------------------------+------------+---------------+-------------------+ | name | city | review_count | city_reviews_sum | +----------------------------------------+------------+---------------+-------------------+ | Hi-Health | Ahwatukee | 4 | 258 | | My Wine Cellar | Ahwatukee | 98 | 258 | | U-Haul | Ahwatukee | 5 | 258 | | Cupz N' Crepes | Ahwatukee | 124 | 258 | | McDonald's | Ahwatukee | 7 | 258 | | Kathy's Alterations | Ahwatukee | 12 | 258 | | Healthy and Clean Living Environments | Ahwatukee | 4 | 258 | | Active Kids Pediatrics | Ahwatukee | 4 | 258 | | Anthem Community Center | Anthem | 4 | 913 | | Scrapbooks To Remember | Anthem | 4 | 913 | | Hungry Howie's Pizza | Anthem | 7 | 913 | | Pinata Nueva | Anthem | 3 | 913 | | Starbucks Coffee Company | Anthem | 13 | 913 | | Pizza Hut | Anthem | 6 | 913 | | Rays Pizza | Anthem | 19 | 913 | +----------------------------------------+------------+---------------+-------------------+ 15 rows selected (0.543 seconds)
Now. Lets try slightly complex query. List Top 10 cities and their highest ranked businesses in terms of #reviews. Drill window functions such as rank, dense_rank can be used in these queries.
. . . . . . . . . . . > WITH X . . . . . . . . . . . > AS . . . . . . . . . . . > (SELECT name, city, review_count, . . . . . . . . . . . > RANK() . . . . . . . . . . . > OVER (PARTITION BY city . . . . . . . . . . . > ORDER BY review_count DESC) AS review_rank . . . . . . . . . . . > FROM `business.json`) . . . . . . . . . . . > SELECT X.name, X.city, X.review_count . . . . . . . . . . . > FROM X . . . . . . . . . . . > WHERE X.review_rank =1 ORDER BY review_count DESC LIMIT 10; +-------------------------------------------+-------------+---------------+ | name | city | review_count | +-------------------------------------------+-------------+---------------+ | Mon Ami Gabi | Las Vegas | 4084 | | Studio B | Henderson | 1336 | | Phoenix Sky Harbor International Airport | Phoenix | 1325 | | Four Peaks Brewing Co | Tempe | 1110 | | The Mission | Scottsdale | 783 | | Joe's Farm Grill | Gilbert | 770 | | The Old Fashioned | Madison | 619 | | Cornish Pasty Company | Mesa | 578 | | SanTan Brewing Company | Chandler | 469 | | Yard House | Glendale | 321 | +-------------------------------------------+-------------+---------------+ 10 rows selected (0.49 seconds)
Compare #reviews for each business with the top and bottom review counts in the city.
0: jdbc:drill:zk=local> SELECT name, city, review_count, . . . . . . . . . . . > FIRST_VALUE(review_count) . . . . . . . . . . . > OVER(PARTITION BY city ORDER BY review_count DESC) AS top_review_count, . . . . . . . . . . . > LAST_VALUE(review_count) . . . . . . . . . . . > OVER(PARTITION BY city ORDER BY review_cout DESC) AS bottom_review_count . . . . . . . . . . . > FROM `business.json` limit 15; +----------------------------------------+------------+---------------+-------------------+----------------------+ | name | city | review_count | top_review_count | bottom_review_count | +----------------------------------------+------------+---------------+-------------------+----------------------+ | My Wine Cellar | Ahwatukee | 98 | 124 | 12 | | McDonald's | Ahwatukee | 7 | 124 | 12 | | U-Haul | Ahwatukee | 5 | 124 | 12 | | Hi-Health | Ahwatukee | 4 | 124 | 12 | | Healthy and Clean Living Environments | Ahwatukee | 4 | 124 | 12 | | Active Kids Pediatrics | Ahwatukee | 4 | 124 | 12 | | Cupz N' Crepes | Ahwatukee | 124 | 124 | 12 | | Kathy's Alterations | Ahwatukee | 12 | 124 | 12 | | Q to U BBQ | Anthem | 74 | 117 | 117 | | Dara Thai | Anthem | 56 | 117 | 117 | | Cafe Provence | Anthem | 53 | 117 | 117 | | Shanghai Club | Anthem | 50 | 117 | 117 | | Two Brothers Kitchen | Anthem | 43 | 117 | 117 | | The Tennessee Grill | Anthem | 32 | 117 | 117 | | Dollyrockers Boutique and Salon | Anthem | 30 | 117 | 117 | +----------------------------------------+------------+---------------+-------------------+----------------------+ 15 rows selected (0.516 seconds)
Compare #reviews with the #reviews for the previous and following businesses
0: jdbc:drill:zk=local> SELECT city, review_count, name, . . . . . . . . . . . > LAG(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC) . . . . . . . . . . . > AS preceding_count, . . . . . . . . . . . > LEAD(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC) . . . . . . . . . . . > AS following_count . . . . . . . . . . . > FROM `business.json` limit 15; +------------+---------------+----------------------------------------+------------------+------------------+ | city | review_count | name | preceding_count | following_count | +------------+---------------+----------------------------------------+------------------+------------------+ | Ahwatukee | 124 | Cupz N' Crepes | null | 98 | | Ahwatukee | 98 | My Wine Cellar | 124 | 12 | | Ahwatukee | 12 | Kathy's Alterations | 98 | 7 | | Ahwatukee | 7 | McDonald's | 12 | 5 | | Ahwatukee | 5 | U-Haul | 7 | 4 | | Ahwatukee | 4 | Hi-Health | 5 | 4 | | Ahwatukee | 4 | Healthy and Clean Living Environments | 4 | 4 | | Ahwatukee | 4 | Active Kids Pediatrics | 4 | null | | Anthem | 117 | Roberto's Authentic Mexican Food | null | 74 | | Anthem | 74 | Q to U BBQ | 117 | 64 | | Anthem | 64 | Outlets At Anthem | 74 | 56 | | Anthem | 56 | Dara Thai | 64 | 53 | | Anthem | 53 | Cafe Provence | 56 | 50 | | Anthem | 50 | Shanghai Club | 53 | 43 | | Anthem | 43 | Two Brothers Kitchen | 50 | 32 | +------------+---------------+----------------------------------------+------------------+------------------+ 15 rows selected (0.518 seconds)
More details and documentation on Window functions and other Drill 1.2 features is available at Drill docs and on MapR docs. Congratulations to the Drill community on another key milestone and look forward for more.
Reference: | Leveraging new features in Drill 1.2 – ANSI SQL Analytic/Window functions from our JCG partner Chase Hooley at the Mapr blog. |