Comparing energy consumption in Scala projects
by James Santucci
- •
- June 07, 2022
- •
- scala• green software• sbt• benchmarking
- |
- 11 minutes to read.

A new release of sbt-energymonitor
makes it possible to persist measurements to a remote server somewhere for later analysis. This post will explain how to
use this new functionality, and how to do a responsible job analyzing the data
after the fact. All examples to follow will be in Scala. Ready? Ok, let’s do it.
Setup
We’re going to analyze some different options we have for adding transactionality to a series of effectful operations. In this case, the operation is going to be to write a random integer with a bunch of keys in a key-value store, read the numbers from all of those keys, and add them up. This problem is admittedly not an especially interesting one. It sort of resembles a poor man’s map-reduce, so it should look familiar enough.
The goal here is not to pick the best way to do this strange task – if it turns
out YOLO-ing mutations with the Scala standard library takes less energy than a
robust and thread safe transactional model like cats-stm
, that doesn’t mean you should always YOLO your mutations. However, if you don’t need safety and
robustness, and aren’t dealing with user input (like in tests), it might turn out
that YOLO-ing is fine and saves some energy. If you run your tests a lot (e.g.,
if you use scala-steward
and have CI runs going all the time where you don’t
really expect there to be any behavioral changes), those small energy savings
might be valuable.
To get started, let’s pull the example repo. The repo defines an algebra that requires three methods – setKey
, setMany
, and getKey
, that effectfully
set and fetch values for some key.
trait KVStore[F[_], K, V] {
def setKey(key: K, value: V): F[Unit]
def setMany(pairs: List[(K, V)]): F[Unit]
def getKey(key: K): F[Option[V]]
}
We’ll also provide two implementations. One implementation dishonestly
suspends mutations in some concurrent F
. It looks something like
this:
def yolo[F[_]: Sync, K, V] =
new KvStore[F, K, V] {
private val underlying: MutableMap[K, V] = MutableMap.empty
def setKey(key: K, value: V): F[Unit] =
Sync[F].delay {
underlying += ((key, value))
}
def setMany(pairs: List[(K, V)]): F[Unit] =
pairs.traverse { case (k, v) =>
setKey(k, v)
}.void
def getKey(key: K): F[Option[V]] =
Sync[F].delay {
underlying.get(key)
}
}
I’m describing the suspension in F
as “dishonest” because, while supposedly we
can make a bunch of concurrent calls in F
, the underlying mutable map has no
knowledge of the concurrent runtime and so we can still run into conflicts.
Our second implementation will use cats-stm
against a standard immutable
Map
. It looks something like this:
def forStm[F[_]: Concurrent, K, V](
stm: STM[F]
): F[KvStore[stm.Txn, K, V]] = {
import stm._
stm.commit(TVar.of(Map.empty[K, V])) map { underlying =>
new KvStore[stm.Txn, K, V] {
import stm._
def setKey(key: K, value: V): Txn[Unit] =
for {
curr <- underlying.get
_ <- underlying.set(curr + ((key, value)))
} yield ()
def setMany(pairs: List[(K, V)]): Txn[Unit] =
pairs.traverse { case (k, v) =>
setKey(k, v)
}.void
def getKey(key: K): Txn[Option[V]] =
underlying.get.map(_.get(key))
}
}
}
There are two important differences between these two implementations. The
first is not thread safe and has no rollback behavior in the event that any of
the operations fail. The second is both thread safe and transactional because,
in the words of the cats-stm
introductory docs){:target=”_blank” rel=”noopener noreferrer”}, “we can execute the transaction against a log (similar to a database) and only commit the final
states to the TVars if the whole transaction succeeds.” Neat!
Energy Consumption Tests
Now that we have two competing implementations, we can benchmark them.
Benchmarking on the JVM requires fighting the JVM’s aggressive optimizations,
which are as likely to have an impact on energy consumption as they are on more
broadly considered performance. JMH and
sbt-jmh
address these problems head on.
However, I’ll sweep them under the rug here, and rely on running lots of
examples with ScalaCheck as a cheap substitute for the warm up and test
iterations JMH would provide.
The tests follow a procedure of generating a bunch of pairs, running the pairs through a KvStore implementation, and adding up the results. For example, for the STM implementation, that looks something like this:
(pairs: List[(String, Int)]) =>
(
for {
stm <- STM.runtime[IO]
kvStore <- KvStore.forStm[IO, String, Int](stm)
ints <- stm.commit(
kvStore.setMany(pairs) >> pairs.flatTraverse { case (k, _) =>
kvStore.getKey(k).map(_.toList)
}
)
} yield {
ints
}
).map { ints =>
ints.sum > 0
}
.unsafeRunSync()
With the ScalaCheck default configurations, each test will run 100 times, which should be enough to generate some variation in energy consumption.
Storing energy measurements
With the previous sbt-energymonitor
release, we could check energy consumption
like so:
> energyMonitorPreSample
> testOnly *STMSpec
> energyMonitorPostSample
[info] During CI attempt x, this run consumed power from x CPU cores.
[info]
[info] The total energy consumed in joules was x.
[info]
[info] In the sampling period, mean power consumption was x watts.
However, the new release provides new powers. Instead of printing to the
console, we can now ship the results to a server for later analysis. To do so,
we’ll need some environment variables that mimic what would be present in a
GitHub Actions setting, namely, GITHUB_REF_NAME
, GITHUB_RUN_ATTEMPT
, and
GITHUB_REPOSITORY
. Since we don’t actually have distinct CI runs, we can keep
GITHUB_RUN_ATTEMPT
at 1 and sample repeatedly. That won’t be relevant here,
but we could just as easily set GITHUB_RUN_ATTEMPT
to the looping variable if
it was important. This short bash script runs the test 100 times for the yolo
or stm
implementations, setting the tag appropriately.
cmd=""
tag=""
case "${1}" in
"--stm")
cmd="testOnly *STMSpec"
tag="stm"
;;
"--yolo")
cmd="testOnly *YOLOSpec"
tag="yolo"
;;
esac;
for run in {1..100}; do
sbt "set energyMonitorPersistenceTag := Some(\"${tag}\");energyMonitorPreSample;kvStore/${cmd};energyMonitorPostSampleHttp";
done;
Visualizing the results
With the results stored elsewhere, it’s now time to do some analysis. With
benchmarking results, it’s common to present two numbers, where Number 1
is
larger or smaller than Number 2
, and conclude that there’s a meaningful
difference between the two numbers. That sort of presentation reflects an
assumption that the distributions are really well separated, something like
this:
If you have the data and separated histograms like that, it’s really easy to tell that, an overwhelming majority of the time, a random draw from distribution 1 will be lower than a random draw from distribution 2.
However, the underlying distributions might not look like that! I ran the tests from the sample repo 216 times for the STM implementation and 206 times for the YOLO implementation, and wound up with energy consumption measurements in pretty overlapping ranges:
Here, the distributions aren’t cleanly separated! While the YOLO implementation
looks like it uses less energy (which matches what we’d have expected in
advance, since it does less work), we can’t say for sure that values drawn from
one distribution are likely to be higher or lower than values drawn from another
distribution just by looking. In such a situation, it’s more appropriate to use
a statistical test. The test in question is a difference of means test, or a
two-sided t-test that you might remember from the first two months of a stats
class long ago. Since we have no prior reason to believe the variances are
equal, we should perform Welch’s t-test. This is easy with scipy
:
import fromscipy.stats import ttest_ind
df = pd.DataFrame(
requests.get("http://localhost:8080/jisantuc/energy-test-example").json()
)
def significant_difference(df, tag1, tag2):
ser1 = df[df["tag"] == tag1]["joules"]
ser2 = df[df["tag"] == tag2]["joules"]
return ttest_ind(ser1, ser2, equal_var=False)
The significant_difference
method returns two values – a test statistic and a
p-value. You can read more about the interpretation of this test in the scipy
docs.
The p-value for my samples was just about 0 (5.08e-17), so in this case, we can confidently say that we expect the YOLO implementation to consume less energy per run than the STM implementation.
How much less energy? In this case, not a ton – the mean consumption for the YOLO implementation was about 68 joules, and for the STM implementation it was 79 joules, so it would take about 33 million runs to save one kilowatt hour. However, greater differences should be achievable in much more interesting workflows than this one.