A look into the future of Apache Spark 2.0
What’s new in Apache Spark 2.0
During the opening keynote at Spark Summit West, Databricks Co-Founder, CTO, and Spark creator Matei Zaharia discussed the future of Apache Spark 2.0.
Zaharia addressed a sold-out crowd consisting of the vast majority of the 2,500 attendees and hyped the upcoming release which stands to be the largest milestone of the open-source processing engine.
It will remain highly compatible with Apache Spark 1.X with over 2000 patches from 280 contributors.
- Structured API improvements (dataframe, dataset, spark session)
- Structured streaming
- MLlib model export
- MLlib R bindings
- SQL 2003 support
- Scala 2.12 support
- Deep learning libraries
- PyData integration
- Reactive streams
- C# bindings: Mobius
- JS bindings: Eclair JS
- Whole-stage code generation which can fuse across multiple operators
- Optimized input/output
- Structured streaming
- Built on DataFrames works on Event time, windowing, sessions, sources, and sinks
- Supports interactive and batch queries
- Aggregate data in a stream then serve using JDBC
- Change queries at runtime
His presentation was followed up by a live demo given by Michael Armbrust. It was an impressive showing of the power of Spark using the Databricks CE. He went through an analysis of tweets about the current political election sorting presidential candidates and popular related phrases (cough, cough…emails, bern, and small hands). You can view a sample and utilize the code to collect the tweets for yourself here: Tweets of the 2016 Election
Apache Spark 2.0 is set to release later this month, but you can view the unstable version here: Apache Spark 2.0
Databricks free community edition now generally available
During the keynote, the Community Edition of Databricks was released to the general public as part of a new series of initiatives from Databricks. “The largest challenge in applying big data is the skills gap,” Zaharia says.
The beta version of the program was launched in February during Spark Summit East in New York. The initial program attracted over 8,000 users according to Databricks, with the top 10% of active users spending an average of six hours a week executing 10,000 commands on average. These numbers are expected to skyrocket with this latest release.
Databricks CE users will have access to a 6GB micro-cluster as well as a cluster manager and the notebook environment to prototype simple applications, according to the release.
You can sign-up for the GA here: Databricks CE
Databricks teams up with UC Berkeley for MOOCs
Databricks, in conjunction with UC Berkeley, has launched five MOOCs focusing on Spark learning. The classes are free on EdX and will all be taught on the DCE:
- Introduction to Apache Spark: Learn the fundamentals and architecture of Apache Spark, the leading cluster-computing framework among professionals. (Starts June 15th)
- Distributed Machine Learning with Apache Spark: Learn the underlying principles required to develop scalable machine learning pipelines and gain hands-on experience using Apache Spark. (Starts July 6th)
- Big Data Analysis with Apache Spark: Learn how to apply data science techniques using parallel programming in Apache Spark to explore big data. (Starts August 10th)
- Advanced Apache Spark for Data Science and Engineering: Learn common Apache Spark use cases and take a deeper dive into Apache Spark’s architecture. (September 21st)
- Advanced Machine Learning with Apache Spark: Learn how to develop and deploy distributed machine learning pipelines and gain the expertise to write efficient, scalable code in Apache Spark. (Starts November 2nd)
Find out more about the classes here: Data Science and Engineering with Apache Spark.
Spark Summit Day One in Summary
There were plenty of in-depth talks on the first day of Spark Summit West. A significant number of talks presented a similar trend of unification throughout multiple outlets. A unified model for both batch and streaming will make it easier to write batch code and switch to streaming with only a few simple changes.
Some additional takeaway notes:
- dataframes/datasets everywhere
- query optimizer knows more resulting in much better performance
- provides similar performance across all languages like Scala, Pythor, R, etc.
- streaming now has a model for cleanly handling event time which due to delays, deviats from processing time
- there’s incredible growth of the ecosystem occurring as well as the size and interest of the community
Why 47 Degrees for Spark:
47 Degrees is a Databricks certified systems integrator with a proven track record of successfully deploying Spark-based programs to handle real-time streaming data, and helping our clients to do the same.
If you’re looking to schedule a more detailed talk about a project, consulting needs, or training, you can schedule a meeting with us by contacting us at [email protected]. Follow us on Twitter for more news from #SparkSummit.