Apache Spark is now a top-level project
The Apache Software Foundation (ASF) happily announced that Apache Spark has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying the project’s stability.
Apache Spark is an Open Source cluster computing framework for fast and flexible large-scale data analysis. Spark has been the talk of the Big Data town for a while, and 2014 was predicted to be the year of Spark.
According to the Spark Web site home page, the engine runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. This is why Cloudera has integrated it into its Hadoop distribution, CDH (Cloudera Distribution including Apache Hadoop). Spark’s big success is not only the fact that it is a fast engine, but also its rapid evolution since past June that it entered the Apache incubator, with contributions including more than 120 developers from 25 organizations.
Spark’s creators from the University of California, Berkeley, have created a company called Databricks to commercialize the technology. According to Ion Stoica, CEO at Databricks and Professor at UC Berkeley, with the Spark project it became much easier for organizations to get insights from big data. Now, an open source community is created and this can help to accelerate the development and adoption of Apache Spark.
One of Sparks’s features, according to “Apache Spark becomes top-level project” article is that it can run on Hadoop 2.0 YARN. Also, Shark, its companion project can implement SQL-on-Hadoop engine that is syntax-compatible with Apache Hive, but claims the same 10x/100x increases in performance over it that Spark claims over raw MapReduce.
Another feature of Spark is that it allows developers to write applications in Java, Python, or Scala. Integrated with Apache Hadoop, Spark is well suited for machine learning, interactive queries, and stream processing, and can read from HDFS, HBase, Cassandra, as well as any Hadoop data source.
Yahoo has congratulated Spark on becoming an Apache top-level project, via Andrew Feng, Distinguished Architect at Yahoo. Feng explaned how Yahoo has helped in evolving Hadoop and related big-data technologies, including Spark. Yahoo has made significant contributions to the development of Spark, since Apache Hadoop is the foundation of Yahoo’s big-data platform.
Apache Spark software is released under the Apache License v2.0, and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project’s day-to-day operations, including community development and product releases. Documentation and ways to become involved with Apache Spark are offered here.
As far as MapReduce is concerned, it seems that Spark is set to take the reins as the primary processing framework for the new Hadoop workloads whereas MapReduce fades. Spark seems to be well suited for next-generation big data applications that might require lower-latency queries, real-time processing or iterative computations on the same data. Spark is technically a standalone project, but it was always designed to work with the Hadoop Distributed File System.
However, there’s still a lot of tooling for MapReduce that Spark doesn’t have yet (e.g., Pig and Cascading), and MapReduce is still quite good for certain batch jobs. Cloudera co-founder and Chief Strategy Officer Mike Olson explained that there are a lot of legacy MapReduce workloads that aren’t going anywhere anytime soon even as Spark takes off.
In fact, there is a Structure Data conference on March 19-20 in New York, where Ion Stoica will be speaking as part of the Structure Data Awards presentation, and the CEOs of Cloudera, Hortonworks, and Pivotal will talk about the future of big data platforms and how they plan to capitalize on them.