Apache Arrow – New level of Performance and Interoperability for Big Data Analytics

Neeraja RentachintalaFebruary 22nd, 2016Last Updated: February 21st, 2016

0 53 2 minutes read

Today we at MapR would like to congratulate Apache Arrow, a cross system data layer to speed up big data analytics and a brand new addition to the Apache Open Source Software community on its announcement as a Top Level project.

Scalable and high performance analytics are top of mind for customers to help them fully realize the business value of Big data and Hadoop, in addition to using it as a pure infrastructure optimization choice with cheaper storage and/or as a batch processing system. There has been a tremendous amount of innovation in the open source community in the past couple of years to enable analytics across all layers of the stack including columnar storage formats (such as Parquet/ORC), in-memory processing layers (such as Drill, Spark, Impala, Storm) and powerful language APIs (such as Python, R). Arrow is the latest addition to the stack and represents a new type of memory-based data interchange format for use across these systems, applications and programming languages.

Efficient columnar representation of data on disk as well as in-memory and columnar processing are key techniques to achieve performance in analytics workloads. Specifically columnar processing allows systems to process data at full hardware speeds by leveraging modern CPU characteristics via vectorized operations and SIMD instructions. Apache Drill is one of the first big data query engines, which is both columnar on disk and in memory. The Arrow format takes its roots from the in-memory data representation originally developed as part of the Apache Drill project, called as Value Vectors. More details on Apache Drill Value Vectors. With more systems and applications moving towards columnar processing, the Drill data representation has evolved as Apache Arrow to aid such processing across systems. In addition to being a high performance columnar data format, Arrow also has the ability to represent hierarchical and dynamically evolving datasets, thereby making it the format of choice for the flexibility needed to handle the variety of big data types being generating by IoT and other modern applications.

We are very excited about the potential and possibilities that Apache Arrow can bring to the big data ecosystem. By having Apache Arrow as a standard data interchange format, new levels of interoperability open up between the various big data analytics systems and applications. Instead of spending large amount of CPU cycles serializing and deserializing data to convert data between various formats, a standard format allows for sharing of data in a seamless and frictionless fashion between systems and processes. This means customers are able to deploy these systems in conjunction with each other without incurring any overhead.

Apache Arrow is a very important initiative for MapR and we look forward to collaborating with the broader big data ecosystem projects towards its next phase of evolution.