Comparing energy consumption in Scala projects

Comparing energy consumption in Scala projects

A new release of sbt-energymonitor makes it possible to persist measurements to a remote server somewhere for later analysis. This post will explain how to use this new functionality, and how to do a responsible job analyzing the data after the fact. All examples to follow will be in Scala. Ready? Ok, let’s do it.

Setup

We’re going to analyze some different options we have for adding transactionality to a series of effectful operations. In this case, the operation is going to be to write a random integer with a bunch of keys in a key-value store, read the numbers from all of those keys, and add them up. This problem is admittedly not an especially interesting one. It sort of resembles a poor man’s map-reduce, so it should look familiar enough.

The goal here is not to pick the best way to do this strange task – if it turns out YOLO-ing mutations with the Scala standard library takes less energy than a robust and thread safe transactional model like cats-stm, that doesn’t mean you should always YOLO your mutations. However, if you don’t need safety and robustness, and aren’t dealing with user input (like in tests), it might turn out that YOLO-ing is fine and saves some energy. If you run your tests a lot (e.g., if you use scala-steward and have CI runs going all the time where you don’t really expect there to be any behavioral changes), those small energy savings might be valuable.

To get started, let’s pull the example repo. The repo defines an algebra that requires three methods – setKey, setMany, and getKey, that effectfully set and fetch values for some key.

trait KVStore[F[_], K, V] {
    def setKey(key: K, value: V): F[Unit]
    def setMany(pairs: List[(K, V)]): F[Unit]
    def getKey(key: K): F[Option[V]]
}

We’ll also provide two implementations. One implementation dishonestly suspends mutations in some concurrent F. It looks something like this:

def yolo[F[_]: Sync, K, V] =
  new KvStore[F, K, V] {
    private val underlying: MutableMap[K, V] = MutableMap.empty
    def setKey(key: K, value: V): F[Unit] =
      Sync[F].delay {
        underlying += ((key, value))
      }

    def setMany(pairs: List[(K, V)]): F[Unit] =
      pairs.traverse { case (k, v) =>
        setKey(k, v)
      }.void

    def getKey(key: K): F[Option[V]] =
      Sync[F].delay {
        underlying.get(key)
      }
  }

I’m describing the suspension in F as “dishonest” because, while supposedly we can make a bunch of concurrent calls in F, the underlying mutable map has no knowledge of the concurrent runtime and so we can still run into conflicts.

Our second implementation will use cats-stm against a standard immutable Map. It looks something like this:

def forStm[F[_]: Concurrent, K, V](
    stm: STM[F]
): F[KvStore[stm.Txn, K, V]] = {

  import stm._

  stm.commit(TVar.of(Map.empty[K, V])) map { underlying =>
    new KvStore[stm.Txn, K, V] {
      import stm._

      def setKey(key: K, value: V): Txn[Unit] =
        for {
          curr <- underlying.get
          _ <- underlying.set(curr + ((key, value)))
        } yield ()

      def setMany(pairs: List[(K, V)]): Txn[Unit] =
          pairs.traverse { case (k, v) =>
            setKey(k, v)
          }.void

      def getKey(key: K): Txn[Option[V]] =
          underlying.get.map(_.get(key))

    }
  }

}

There are two important differences between these two implementations. The first is not thread safe and has no rollback behavior in the event that any of the operations fail. The second is both thread safe and transactional because, in the words of the cats-stm introductory docs){:target=”_blank” rel=”noopener noreferrer”}, “we can execute the transaction against a log (similar to a database) and only commit the final states to the TVars if the whole transaction succeeds.” Neat!

Energy Consumption Tests

Now that we have two competing implementations, we can benchmark them. Benchmarking on the JVM requires fighting the JVM’s aggressive optimizations, which are as likely to have an impact on energy consumption as they are on more broadly considered performance. JMH and sbt-jmh address these problems head on. However, I’ll sweep them under the rug here, and rely on running lots of examples with ScalaCheck as a cheap substitute for the warm up and test iterations JMH would provide.

The tests follow a procedure of generating a bunch of pairs, running the pairs through a KvStore implementation, and adding up the results. For example, for the STM implementation, that looks something like this:

  (pairs: List[(String, Int)]) =>
    (
      for {
        stm <- STM.runtime[IO]
        kvStore <- KvStore.forStm[IO, String, Int](stm)
        ints <- stm.commit(
          kvStore.setMany(pairs) >> pairs.flatTraverse { case (k, _) =>
            kvStore.getKey(k).map(_.toList)
          }
        )
      } yield {
        ints
      }
    ).map { ints =>
        ints.sum > 0
      }
     .unsafeRunSync()

With the ScalaCheck default configurations, each test will run 100 times, which should be enough to generate some variation in energy consumption.

Storing energy measurements

With the previous sbt-energymonitor release, we could check energy consumption like so:

> energyMonitorPreSample
> testOnly *STMSpec
> energyMonitorPostSample
[info]  During CI attempt x, this run consumed power from x CPU cores.
[info]
[info]  The total energy consumed in joules was x.
[info]
[info]  In the sampling period, mean power consumption was x watts.

However, the new release provides new powers. Instead of printing to the console, we can now ship the results to a server for later analysis. To do so, we’ll need some environment variables that mimic what would be present in a GitHub Actions setting, namely, GITHUB_REF_NAME, GITHUB_RUN_ATTEMPT, and GITHUB_REPOSITORY. Since we don’t actually have distinct CI runs, we can keep GITHUB_RUN_ATTEMPT at 1 and sample repeatedly. That won’t be relevant here, but we could just as easily set GITHUB_RUN_ATTEMPT to the looping variable if it was important. This short bash script runs the test 100 times for the yolo or stm implementations, setting the tag appropriately.

cmd=""
tag=""

case "${1}" in
  "--stm")
    cmd="testOnly *STMSpec"
    tag="stm"
    ;;

  "--yolo")
    cmd="testOnly *YOLOSpec"
    tag="yolo"
    ;;
esac;

for run in {1..100}; do
    sbt "set energyMonitorPersistenceTag := Some(\"${tag}\");energyMonitorPreSample;kvStore/${cmd};energyMonitorPostSampleHttp";
done;

Visualizing the results

With the results stored elsewhere, it’s now time to do some analysis. With benchmarking results, it’s common to present two numbers, where Number 1 is larger or smaller than Number 2, and conclude that there’s a meaningful difference between the two numbers. That sort of presentation reflects an assumption that the distributions are really well separated, something like this:

Bar chart showing separated distributions between distribution 1 and distribution 2.

If you have the data and separated histograms like that, it’s really easy to tell that, an overwhelming majority of the time, a random draw from distribution 1 will be lower than a random draw from distribution 2.

However, the underlying distributions might not look like that! I ran the tests from the sample repo 216 times for the STM implementation and 206 times for the YOLO implementation, and wound up with energy consumption measurements in pretty overlapping ranges:

Bar chart showing the energy consumption of both STM and YOLO implementations.

Here, the distributions aren’t cleanly separated! While the YOLO implementation looks like it uses less energy (which matches what we’d have expected in advance, since it does less work), we can’t say for sure that values drawn from one distribution are likely to be higher or lower than values drawn from another distribution just by looking. In such a situation, it’s more appropriate to use a statistical test. The test in question is a difference of means test, or a two-sided t-test that you might remember from the first two months of a stats class long ago. Since we have no prior reason to believe the variances are equal, we should perform Welch’s t-test. This is easy with scipy:

import fromscipy.stats import ttest_ind

df = pd.DataFrame(
    requests.get("http://localhost:8080/jisantuc/energy-test-example").json()
)

def significant_difference(df, tag1, tag2):
    ser1 = df[df["tag"] == tag1]["joules"]
    ser2 = df[df["tag"] == tag2]["joules"]
    return ttest_ind(ser1, ser2, equal_var=False)

The significant_difference method returns two values – a test statistic and a p-value. You can read more about the interpretation of this test in the scipy docs.

The p-value for my samples was just about 0 (5.08e-17), so in this case, we can confidently say that we expect the YOLO implementation to consume less energy per run than the STM implementation.

How much less energy? In this case, not a ton – the mean consumption for the YOLO implementation was about 68 joules, and for the STM implementation it was 79 joules, so it would take about 33 million runs to save one kilowatt hour. However, greater differences should be achievable in much more interesting workflows than this one.

Ensure the success of your project

47 Degrees can work with you to help manage the risks of technology evolution, develop a team of top-tier engaged developers, improve productivity, lower maintenance cost, increase hardware utilization, and improve product quality; all while using the best technologies.