Pitfalls of Mocking in tests

Pitfalls of Mocking in tests

What is mocking

“Mocking” is a word that can mean different things to different people. Other terms you might see are “stubs” or “test doubles.”

Here I’ll use “mocking” to specifically refer to “magical”, macro, reflection-based, or monkey-patch-based mocking.

Examples of mocking:

  • Scalamock
  • Mockito
  • python mock.patch

When we implement a stripped-down version of an interface to use in tests, for example to return canned data or simply fail with an error, I’ll call that a stub.

Examples of stubs:

  • new MyInterface { def method() = throw new Exception("fail") }
  • new UserRepository { def getUser(id: Int) = User(id, "fred", "flintstone") }

When I talk about using an extra implementation of an interface that honors the semantics of the interface, but is stripped down in terms of complexity or dependencies, I’ll call that a test double, or simulator.

Examples of test doubles and simulators:

  • Using a Map to stand in for a database or file system.
  • Using an in-process version of an external service, like embedded mongo.
  • Using a local version of an external service, such as dynalite or a minio s3.

Why mocks aren’t the right choice

  • It’s easy to make mock instances incorrect - no guarantee it’s consistent with a correct implementation, which can lead to brittle tests.
  • Mocks discourage good interface design by encouraging testing implementation details instead of contracts.
  • Tests should be “executable documentation.”
  • Because they can use more-magical features like macros and reflection, they are less portable - e.g., scalamock doesn’t support scala 3 and has no current roadmap for it.

Brittle tests

Mock instances are easier to construct when they don’t implement any logic, but just return canned results. This can be an impediment though as the ways that callers invoke their dependencies change.

For example, consider a case where we have a system that stores snapshots from an event sourced stream.

trait SnapshotStorage {
  def find(id: EntityId): IO[Option[Entity]]
  def update(id: EntityId, entity: Entity): IO[Unit]

class EventProcessor(store: SnapshotStorage) {
  def handle(event: Event): IO[Unit] =
    store.update(event.entityId, event.entity)

And of course for that we’d have a test:

test("should store new entities") {
  val storage = mock[SnapshotStorage]
  (storage.update _).expects(*, *).returning(())

  val ep = new EventProcessor(storage)
  ep.handle(testEvent()).attempt.map(result => assert(result.isRight))

Now we’d like to update our processor to only perform an update when the event we receive is newer than what’s stored:

class EventProcessor(store: SnapshotStorage) {
  def handle(event: Event): IO[Unit] = {
    def update = store.update(event.entityId, event.entity)
    store.find(event.entityId).flatMap {
      case Some(existing) =>
        if (existing.version <= event.entity.version)
        else IO.unit
      case None => update

Given that code update, our existing test fails, despite the behavior it describes still working. We updated the code in EventProcessor but our test code needed to change how we interacted with SnapshotStorage. This is the exact kind of tight coupling that can cripple a codebase over time and increase the cost to the business by making new changes harder, and making it easier to introduce regressions. In a loosely-coupled system, adding new features shouldn’t make us rewrite the tests for old features.

If instead of mocks we used a test double, the old test would still work without modification.

class TestStorage(data: Ref[Map[EntityId, Entity]]) extends SnapshotStorage {
  def find(id: EntityId): IO[Option[Entity]] =
    data.get.map { entityMap =>
  def update(id: EntityId, entity: Entity): IO[Unit]
    data.update { entityMap =>
      entityMap.updated(id, entity)

Discouraging interface design

One of the ways that magic mocking can discourage interface design is by making it too easy to write tests that are hardcoded to a poorly designed class. Instead of the friction from the class prompting an improvement to the code, it’s easy to gloss over.

For example, this is similar to some code I’ve seen before:

class ElasticsearchRepository(client: ElasticClient) {
  def find(documentId: String)(implicit ec: ExecutionContext, log: Logger): Future[Json] = /* impl */
  def upload(document: Json)(implicit ec: ExecutionContext, log: Logger): Future[String] = /* impl */

Two issues in particular that this code has are these:

  • Because it doesn’t have an interface, it becomes more difficult to create a test double. You need to supply an ElasticClient even if you don’t intend to use it - leading to pointless boilerplate code in tests, and more potential for coding errors.
  • It complects implementation details (logging and thread pools) with the business domain (searching data in Elasticsearch). A good interface should only contain information about its domain, and thread pools aren’t part of the semantic domain of an Elasticsearch repository.

Refactoring this to an interface resolves these issues:

trait ElasticsearchRepository {
  def find(documentId: String): Future[Json]
  def upload(document: Json): Future[String]

This gives us simpler code and less hassle when testing.

Tests as executable documentation

We write tests because we want to gain confidence in the correctness of systems we’re writing, and because we want to reduce the risk of regression. Most tests are at least partially unrealistic, but you get the most confidence and the biggest reduction of risk when the code in tests is exactly the same as the code people write for real.

In our production code, we don’t use mocks - we use plain interfaces, methods, and values. So a test that’s also using the same kinds of interfaces, methods, and values is going to be closer to what actually runs in production.

Suppose we have a UserService that allows us to query our user account information from a database, and we want to write tests for our NotificationService that uses it to format a template with information about the user.

trait UserService {
  def find(id: UserId): IO[User] // fails if not found
  def register(email: EmailAddress, name: String): IO[UserId]
trait NotificationService {
  def formatNotification(userId: UserId, template: NotificationTemplate): IO[String]

And here’s our implementation of those, backed by a database:

class DbUserService(xa: doobie.Transactor[IO]) extends UserService {
  def find(id: UserId): IO[User] =
    sql"SELECT email, name FROM users WHERE id = ${id}"

  def register(email: EmailAddress, name: String): IO[UserId] =
    sql"INSERT INTO users (email, name) VALUES ($email, $name)"

class NotificationServiceImpl(users: UserService) extends NotificationService {
  def formatNotification(userId: UserId, template: NotificationTemplate): IO[String] =
      .map { user =>
        formatTemplate(user, template)
  private def formatTemplate(user: User, template: NotificationTemplate): String = ???

Here’s how we might test these, both with and without mocks:

val fred = User(EmailAddress("fred@flintstones.com"), "Fred Flintstone")
test("with scalamock") {
  val users = mock[DbUserService] // 1
  (users.find _).expects(*).returning(fred) // 2
  val ns = new NotificationServiceImpl(users)
  ns.formatNotification(fredId, NotificationTemplate.bowlingNight)
    .map { result => assert(result.isRight) }

test("using a test double") {
  // 3: a `UserServiceTestDouble` built on top of Ref+Map can be defined elsewhere

  for {
    users <- UserServiceTestDouble.create
    fredId <- users.register(fred) // 4

    ns = new NotificationServiceImpl(users)
    result <- ns.formatNotification(fredId, NotificationTemplate.bowlingNight).attempt

  } yield { assert(result.isRight) }

From the code sample above:

  1. Here, we are allowed to construct database-based mock instances, despite only wanting to use the interface - this increases chances of writing incorrect or tightly-coupled code.
  2. We set an explicit return value for the “find” method. But how do we know that find is what we need? It’s an implementation detail of NotificationService.
  3. When using a test double, we isolate and abstract over repeated setup code - normal practice for writing healthy code.
  4. We are writing our test using the domain language directly: first a user is registered. Then we notify that user. The notification shouldn’t fail. Contrast this to point 2, where the test has hardcoded knowledge of NotificationService invoking specific methods of its dependencies.

How to avoid mocks

Now that we understand what we want to avoid, what should we do instead?

There are a few key techniques that we can use to arrive at a healthy codebase:

  • Use interfaces
  • Use test doubles and/or simulators
  • Write your test code to the same standard as production code

Use interfaces

The strongest tool we have for writing clean code is to use interfaces.

Keep your interfaces simple and straightforward, and try to keep them isolated. Imagine you’re writing the subcomponent at hand as a library. What would you expect to see if you grabbed the library “off the shelf” from an open source dependency? Write that. Avoid mixing in extra details.

A good rule of thumb for your business interfaces is that they should only mention terms that are relevant in your business domain. Avoid mixing concepts from multiple layers.

Dependencies of classes are usually implementation details and should appear in your class constructor, not in your methods. Examples of this would be: ExecutionContext, loggers, typeclass instances, connection pools, third-party library objects.

Use test doubles and simulators

Write your tests using simple test doubles and simulators. It’s not a bad thing for you to have tests that invoke an external process! Tools like Docker can make it easy to have external resources like S3 or Postgres available for your tests. There’s no need to reinvent the wheel.

Keep the same coding standards

You want to keep the same coding standards in your tests as in your production code. The same things that make readable production code make for readable tests too. Refactor repeated boilerplate, use methods, stick to interfaces and not implementation details.

When are magic mocks good?

Everything in software is a tradeoff - it’s very rare that there’s one thing vs another that is directly better in every single case. So with that in mind, when would you still want to use magic mocks anyway? What do they offer that we can’t easily get from writing code with interfaces and contracts?

Let’s revisit some of the points about mocks that make them undesirable:

  • They hide design smells by making it easy to construct unrealistic setups, and bypass class dependencies.
  • They make it easier to write simple data stubs for complex and badly designed interfaces.
  • They encourage tests to contain implementation details instead of being based around interface contracts.

So when are those aspects strengths? I think that the best argument for using mocks is to help rehabilitate an unhealthy codebase. If a codebase has poorly designed interfaces, missing tests, and the contracts are undocumented - that’s when mocks can help you bridge the gap. A common technique in rehabilitating unhealthy codebases is to use “characterization tests.” Characterization tests, unlike most others, don’t focus on providing confidence in your design and general code correctness. Instead, they serve to describe the behavior of an unknown system. “I don’t know how this black box works in general, but when I push this button, it lights up.” That’s the kind of characterization test you can write, and by writing those, you gain documentation and regression resistance.

Mocks can help cut the knot of tangled dependencies and write tests that describe how the system functions today. Those tests fill the gap of missing documentation, and they provide insurance against accidentally breaking those behaviors.

That said, this isn’t an ideal state for a codebase to be in long-term. You should identify the path toward a healthy codebase, using mocks as one step along the way. Once they’ve served their purpose and your codebase is back to a good state, with sensible interfaces, contracts, and reusable code structures, plan for your new tests to use a clean style rather than continuing the magic.


We’ve seen how magic mocks in tests can produce brittle test code that adds risk to delivering your features on time and working correctly. To avoid them, we can use interfaces and write smarter test helpers, like test doubles or external dependency simulators. Mocks can still be the right tool to help rehabilitate an unhealthy codebase. Using them judiciously can help quickly add characterization tests to an undocumented codebase, but have a plan for how to stop using them down the line.

Happy testing!

Ensure the success of your project

47 Degrees can work with you to help manage the risks of technology evolution, develop a team of top-tier engaged developers, improve productivity, lower maintenance cost, increase hardware utilization, and improve product quality; all while using the best technologies.