What is Spark?
Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, and Python, and an optimized engine that supports general computation graphs for data analysis. A recent survey dictated that 88% of Spark users are also utilizing Scala as their language of choice. The two are a logical choice and play nice together because Spark was written in Scala, but it also has APIs in Java, Python, and R. Apache Spark is also the number one open-source project within the Big Data ecosystem, and with over 1,000 contributors, the technology is improving at an accelerating pace.
47 Degrees provides
Why use Spark?
Apache Spark is quickly assuming it’s place as the premiere technology to handle big data, it features:
-
Integrated platform
Spark provides a comprehensive set of integrated data tools, with new data tools added on a near daily basis.
-
Extensible platform
Designed for extensibility making it simple to add new components.
-
Near real-time:
Spark Streaming provides the ability to process data in mini-batches yielding scalable, high throughput, near real-time processing of data streams.
-
Speed:
Spark includes “query optimizers” that can optimize over multiple steps.
-
Clean rich API:
The Spark API supports iterative computation providing the opportunity to create concise code.
-
Scalable:
Designed to run on large, scalable clusters of nodes.
-
Reliable:
Easily recovers from failures using checkpointing or re-computation.
-
Interconnectivity:
Spark connects to a wide variety of data sources and sinks including most databases and messaging systems.