47 Degrees joins forces with Xebia read more

Spark On: Let's Code! (Part 3)

Spark On: Let's Code! (Part 3)

This is the final post of our series: Spark On: Let’s Code! While our mini-project was somewhat focused on the DevOps side, it’s also an important lesson for developers. Looking at it from our perspective, if the code deployment cycle isn’t completed along a software infrastructure, we won’t know if our code is 100% valid. Therefore, we’re going to dedicate this last post to discussing how we can set up our application to be distributable within a Docker Container and how we can deploy it in tandem with all the necessary services.

With this in mind, here are the main things we will discuss:

## Deployment Pipeline

We’ve set up our deployment pipeline in four steps; the related bash script can be found here.

Let’s take a look at these steps:

  • Build our application: In the real world, this step should be done by using any Continuous Integration tool like Travis or Jenkins. Since that’s out of the scope of this example, we’ve just included it in our deploy.sh script.
  • Run up all the services, including our app.
  • Execute some additional configuration commands to complete our cluster settings. For instance, creating the checkpoint directory in the Hadoop cluster used by the Spark Streaming.
  • Scaling out Spark Worker nodes in order to have at least two instances running.

Dockerizing our Application

To dockerize our Akka Http/Spark application, we will use sbt-docker which can be built in the continuous integration pipeline. For example, each time a new Pull Request is merged into the master branch.

The code:

lazy val dockerSettings = Seq(
docker <<= «docker dependsOn assembly»,
imageNames in docker := Seq(ImageName("47deg/sparkon")),
dockerfile in docker := {
  val workingDir = s"/opt/sparkOn"
  val artifact = (assemblyOutputPath in assembly).value

  val artifactTargetPath = s"/opt/sparkOn/${artifact.name}"
  val sparkPath = "/usr/local/spark/assembly/target/scala-2.11/spark-assembly-1.5.1-hadoop2.4.0.jar"

  val mainclass = mainClass.in(Compile, packageBin).value.getOrElse(sys.error("Expected exactly one main class"))
  val classpathString = s"$sparkPath:$artifactTargetPath"

  new Dockerfile {
    // Base image
    from("47deg/spark:1.5.1")
    // Mantainer
    maintainer("47 Degrees", "[email protected]>")

    // Set working directory
    workDir(workingDir)

    // Add the JAR file
    add(artifact, artifactTargetPath)

    cmdRaw(s"java " +
        s"-verbose:gc " +
        s"-XX:+PrintGCDetails " +
        s"-XX:+PrintGCTimeStamps " +
        s"-Xmx2G " +
        s"-XX:MaxPermSize=1G -cp $classpathString $mainclass")
  }
})
»docker dependsOn assembly|We have set up a sbt task dependency with assembly in order to create a fat JAR of our project with all of its dependencies, before starting creating the docker build.«

This docker build will be based on the 47deg/spark:1.5.1 Docker Image, which we know for sure, contains the Spark Assembly. I know, this sounds like a trick, but we’re saving having to add the Spark JAR in an HDFS or AWS location for our example. We should mention that 47deg/spark:1.5.1 docker build is a custom Spark compilation for Scala 2.11, given that the distributed Spark artifacts are currently compiled with Scala 2.10.

Therefore, we can build our Spark On Docker Image by running this sbt command:

sbt ";project api;docker"

Docker Composer to deploy our Infrastructure

Docker Compose is a tool for defining and running multi-container applications with Docker. That’s precisely what we want to do, define a multi-container cluster in a single descriptor file. With this intention, we could spin our application up with a single command:

docker-compose up -d

Let’s look at the docker-compose.yml file in chunks in the following sections:

Spark Docker Containers

We will set up a Standalone Cluster which will need at least two docker containers:

spark_master:
  image: 47deg/spark:1.5.1
  ports:
  - "7077:7077"
  - "8080:8080"
  container_name: spark_master
  tty: true
  command: /start-master.sh
spark_worker:
  image: 47deg/spark:1.5.1
  links:
    - «spark_master»
  command: /start-worker.sh
»Spark Master link|We are linking the Spark Worker to the Spark Master container.«

Furthermore, we can scale out this Spark Cluster utilizing the Spark Worker nodes by running this command:

docker-compose scale spark_worker=5

After that, we will have five Spark Worker nodes available. We can also scale down with a similar command, like this:

docker-compose scale spark_worker=3

Cassandra and DataStax Opscenter Docker Containers

As you’re probably aware, we always have a seed node in a Cassandra Cluster. Keeping that in mind, this would be the docker-compose chunk descriptor:

opscenter:
  image: 47deg/opscenter
  ports:
  - "8888:8888"
  container_name: opscenter
cassandra_seed:
  image: 47deg/cassandra
  ports:
  - "9042:9042"
  links:
    - opscenter
  container_name: cassandra_seed
  environment:
    - «OPS_IP=opscenter»
cassandra_slave:
  image: 47deg/cassandra
  links:
    - opscenter
    - cassandra_seed
  environment:
    - OPS_IP=opscenter
    - SEED=cassandra_seed
»Environment Variables|With this setup we are setting an environment variable within the cassandra_seed container, which will contain the opscener hostname.«

During deployment, we need to set up three new docker nodes:

  • Opscenter - to visualize the Cassandra Cluster.
  • One Cassandra seed node.
  • One Cassandra slave node.

As we mentioned earlier, we can scale out the number of available Cassandra nodes in our infrastructure in this manner:

docker-compose scale cassandra_slave=3

Hadoop and Yarn-Cluster

To deploy the Hadoop Cluster, we based our work on the popular Hadoop docker image from SequenceIQ, who’re now a part of Hortonworks Company.

namenode:
  image: 47deg/yarn-cluster
  working_dir: /usr/local/hadoop
  ports:
  - "8088:8088"
  - "50070:50070"
  - "50075:50075"
  container_name: namenode
  command: bash -c "/etc/bootstrap.sh -d -namenode"
datanode:
  image: 47deg/yarn-cluster
  working_dir: /usr/local/hadoop
  links:
    - namenode
  command: /etc/bootstrap.sh -d -datanode

With everything running, we can create our Spark Checkpoint HDFS location as follows:

docker exec -t namenode /usr/local/hadoop/bin/hadoop fs -mkdir /checkpoint

Kafka Cluster

As we posted in the previous Spark On! entry, we’re using Kafka to create the web socket with our chosen track words. Therefore, we will automatically deploy three Kafka nodes and only one Zookeper coordinator node in our example.

zookeeper:
  image: 47deg/zookeeper
  ports:
    - "2181:2181"
kafka_1:
  image: 47deg/kafka
  ports:
    - "9092"
  links:
    - zookeeper:zk
  environment:
    KAFKA_ADVERTISED_HOST_NAME: 192.168.99.100
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
kafka_2:
  image: 47deg/kafka
  ports:
    - "9092"
  links:
    - zookeeper:zk
  environment:
    KAFKA_ADVERTISED_HOST_NAME: 192.168.99.100
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
kafka_3:
  image: 47deg/kafka
  ports:
    - "9092"
  links:
    - zookeeper:zk
  environment:
    KAFKA_ADVERTISED_HOST_NAME: 192.168.99.100
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock

While we defined three nodes in this example, we could have defined just one Kafka node and then scaled the available node using something like this, docker-compose scale kafka=3. It’s fine to handle this either way in the docker-compose descriptor file.

App Docker Build

In this chapter, we will define our app docker container based on the build discussed in the chapter Dockerizing our Application.

sparkon:
  image: 47deg/sparkon
  ports:
  - "9090:9090"
  container_name: sparkon
  links:
    - spark_master
    - cassandra_seed
    - cassandra_slave
    - namenode
    - zookeeper
    - kafka_1
    - kafka_2
    - kafka_3
  env_file: sparkOn.env

Notice we’ve included a few things in here:

  • Dependencies with the previous docker containers (all the services our application needs).
  • sparkOn.env file where we have set up all the needed environment variables to run the application, including the Twitter Credentials environmental variables.

Additionally, we have to keep in mind that Docker Compose is injecting a set of environment variable for each linked container, and each one begins with the uppercase name of the container. An interesting one, for example, is the networking IP address of a specific service. Take a look at the docs to get more information about it. For instance, in this example we’re using one to get the IP of the Spark Master and configuring it to properly run our application. “SPARK_MASTER_PORT_7077_TCP_ADDR”” follows the “name_PORT_num_protocol_ADDR”.

Cluster Deployment and Run the Application

We’ve finally reached the last chapter in our Spark On! series. To start off, we need to define a few environment variables in this config file.

We’ve defined a bash script to deploy the whole cluster dependencies, including the Spark Streaming Application, which means, we can run it in this way:

scripts/deploy.sh

If everything is functioning correctly, we can start the Twitter Streaming as follows:

curl -X "POST" "https://192.168.99.100:9090/twitter-streaming" \
  -H "Content-Type: application/json" \
  -d $'{
  "recreateDatabaseSchema": true,
  "filters": [
    "lambda",
    "scala",
    "akka",
    "spray",
    "play2",
    "playframework",
    "spark",
    "java",
    "python",
    "cassandra",
    "bigdata",
    "47 Degrees",
    "47Degrees",
    "47Deg",
    "programming",
    "chicharrones",
    "cat",
    "dog"
  ]
}'

Scaling Out Services

We can scale out services, such as increasing the Spark Workers available:

docker-compose scale spark_worker=5

Stop and Remove Cluster

We can stop the streaming gracefully, before stopping the cluster:

curl -X "DELETE" "https://192.168.99.100:9090/twitter-streaming"

And then, from the shell:

cd scripts
docker-compose stop
docker-compose rm

Conclusion

Altogether, we can conclude by saying that Spark and all the tech stacks around it, including Scala and Akka, make a developers life much easier when trying to solve Big Data problems. It also covers all kinds of business software requirements in a Reactive way, taking into account features like fault tolerance, responsiveness, elasticity, resilience, distributable, and high performance.

We hope you enjoyed this simple project and found it helpful. As always, you can check out the entire code here.

See you soon!

Further References

Ensure the success of your project

47 Degrees can work with you to help manage the risks of technology evolution, develop a team of top-tier engaged developers, improve productivity, lower maintenance cost, increase hardware utilization, and improve product quality; all while using the best technologies.