47 Degrees joins forces with Xebia read more

Adventures with Scala Collections

Adventures with Scala Collections

We’d like to cover some ideas and hopefully, provide some clarification, about the different options available for Scala Collections and which ones we recommend using. Keep in mind that recommendations may vary depending on your particular case. This technical decision is not just for Scala beginners, but experimented developers alike, as both can fall victim to bad practices when using collection structures to manage their data.

Scala Collections Overview

Let’s start with an introduction to the Scala Collections (scala.collection package) in the standard library. In general, two types exist:

  • Mutable: allows you to update, add, or remove elements of a collection as a side effect - located on the scala.collection.mutable package.
  • Immutable: never changes after the collection is created - located on the scala.collection.immutable package.

In most cases, the collections located on the root package scala.collection define common parent interfaces for some mutable and immutable collections. You can find more information in the official Scala docs.

Here are three informative diagrams representing all the collections in the three commented packages. These can also be found in the official Scala docs:

  • Collections in scala.collection package:

Collections in scala.collection package

  • Collections in scala.collection.mutable package:

Collections in scala.collection.mutable package

  • Collections in scala.collection.immutable package:

Collections in scala.collection.immutable package

(All three figures were generated by Matthias at decodified.com).

As you can see, there are many different traits and classes and we can’t cover them all at once. For our purposes today, we will focus in on the most common forms relating to Seq. We won’t cover Set and Map in this post, but the principle ideas and recommendations we do cover are valid and analogous for all of the above.

Seq, List or Vector, that is the question

In most projects, we have to deal with sequences of objects (User from the database, Shop Products, etc.). However, sometimes as developers, we don’t think about which one is a better fit for the topology of our algorithms, so we don’t see this question arise before development begins. We’d like to change that by putting the various options on the table to motivate people to discuss which option is the best course early on.

Let’s figure out whether Seq, List or Vector better match our needs, based on these four main points:

  • Mutability
  • Performance
  • Usability (Pattern matching)
  • Law Abiding

Seq

A base trait for sequences. There isn’t a lot that needs to be said so, let’s discuss our key points.

Seq - Mutability

The first thing to realize is that the default version you’ll be using if you don’t import anything special will be the Seq trait located on scala.collections.Seq. In fact, if you look at the code within the scala package object, we can find:

type Seq[+A] = scala.collection.Seq[A]
val Seq = scala.collection.Seq

Keeping this in mind, let’s put together an example where we’re designing our API and have a method that is receiving a Seq parameter:

def proccessSequence(coll: Seq[Int]) = ...

As you can imagine, our proccessSequence function can receive both Seq versions, the immutable one, scala.collection.immutable.Seq and also the mutable one, scala.collection.mutable.Seq.

Therefore, if we assume we’re defining an API that’s trying to preserve the immutability principle, we’d be incorrect because we’ll need to add the import scala.collection.immutable.Seq along all the Scala files to guarantee this immutability principle. I know, this looks like a boilerplate. That’s why we often assume that Seq is mutable by default.

In my humble opinion, that’s not a good thing. Even I would suggest skipping this section and going on to the next one. I’m kidding! This is an imperative feature, and you need to be aware of the default behavior. The responsibility lies on our shoulders as developers, and the import should be completed if we want to use the immutable version.

The reason for this behavior is explained in detail in the Effective Scala Docs:

The default Traversable, Iterable and Seq types in scope – defined in scala.package – are the scala.collection versions, as opposed to Map and Set – defined in Predef.scala – which are the scala.collection.immutable versions. This means that, for example, the default Seq type can be both the immutable and mutable implementations. Thus, if your method relies on a collection parameter being immutable, and you are using Traversable, Iterable or Seq, you must specifically require/import the immutable variant, otherwise someone may pass you the mutable version.

Finally, to wrap up this section, if you are a Java developer just starting with Scala, the Seq would be comparable to the java.util.List type in Java. So in general, let’s say that a Java interface is equivalent to the Scala traits, in this case, we have:

  • Scala: trait Seq[+A]
  • Java: public interface List<E>

It should be noted that Java doesn’t have any equivalent immutable collections.

Seq - Performance

From now on, we’ll take into account the official Scala docs to measure the collections’ performance. For the duration of this post, we’re going to use the same notation:

Time Complexity Description
C The operation takes (fast) constant time.
eC The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.
aC The operation takes amortized constant time. Some invocations of the operation might take longer, but if many operations are performed on average only constant time per operation is taken.
Log The operation takes time proportional to the logarithm of the collection size.
L The operation is linear meaning it takes time proportional to the collection size.

Most of the available operations are linear, meaning it will take time proportional to the collection size, L ~ O(n). O(n) is expressing time complexity, for instance, the append operation for Seq will take linear time. You can learn more about Time Complexity here. This means that if we have an infinite collection, some of the linear operations won’t terminate. In contrast, head and tail are efficient, especially the LinearSeq, which is a sub-trait of Seq.

  • head returns the first element of the LinearSeq collection, providing fast access to it.
  • tail returns all elements except the first. Behind the scenes, a new collection without the first element is being returned.

Other optimized and efficient operations are apply, length, and (if mutable) update operations, especially in the in the IndexedSeq, which is also a subtype of Seq. These methods are optimized and run faster under the assumption of fast random access with apply.

So, in general, we can say that Seq is inefficient by nature because it is considered as an infinite collection, and, added with O(n) time complexity, most operations won’t terminate. In other words, the use of Seq is not safe in every case because it’s not lazy like Stream is for example.

Also, it’s important to mention that, Seq has a few limitations which make it an inadequate collection for parallel programming.

### Seq - Usability

In this section we’ll cover how to pattern match our sequences, for example:

scala> :paste
// Entering paste mode (ctrl-D to finish)

def matchAndPlusOneSequence(s: Seq[Int]): Seq[Int] = s match {
  case Seq(x, xs @ _ *) => (x + 1) +: matchAndPlusOneSequence(xs)
  case _ => Seq.empty[Int]
}

// Exiting paste mode, now interpreting.

matchAndPlusOneSequence: (s: Seq[Int])Seq[Int]

scala> matchAndPlusOneSequence(Seq(1, 2, 3, 4, 5, 6))
res0: Seq[Int] = List(2, 3, 4, 5, 6, 7)

At this point, things look good given that we’ve defined some extractors.

Seq - Law Abiding

If we’re thinking in purely functional programming terms, it’s important to mention that Seq doesn’t obey every algebraic law. We might assume that the Seq is mutable, and in that case, it’s impossible to define lawful typeclass instances for mutable structures because reference identity matters. Scala functional libraries such as Scalaz and Cats, don’t have any instances for the Seq collections since they have so many inference problems, and algebraic laws are unmet.

In addition, I would like to add one of Rob Morris’ statements from this discussion, which fits in perfectly here:

Even if you stick with collection.immutable you find that implementations are very unconstrained; some may obey a given typeclass’s laws and others may not. For instance you could define Functor[immutable.Iterable], but it wouldn’t be lawful for Set or Map. I think this is a general problem with defining instances for non-final data types. Now, you might want to argue that any sensible immutable.Iterable/Seq will end up being an okay Foldable since this typeclass has very weak constraints.

Seq - Conclusions

As I’m sure you’ve figured out, Seq is mutable by default and due to the inefficient and significant features, it’s enough to lose trust in Seq.

List

A List is a class used for linked lists representing ordered collections of elements of a specific type. List is an abstract class (recall Seq is a trait), subtype of LinearSeq, as you can see in the previous images.

List - Mutability

Scala List is immutable by default; that’s the main difference in respect to Scala Seq.

If we, once again, look at the code within the scala package object, we can see that the default definition is the immutable version:

type List[+A] = scala.collection.immutable.List[A]
val List = scala.collection.immutable.List

For those of you coming from Java, ArrayList is not the equivalent implementation of Scala List, though it might be easy to think that. As we’ve pointed out previously, Java doesn’t have indistinguishable immutable collections. Therefore, the better match for ArrayList would be the scala.collection.mutable.ArrayBuffer. With that in mind, the Java LinkedList can be considered as the equivalent (and mutable version) of the Scala List collection.

List - Performance

Let’s look at the performance characteristics from the Scala Documentation, where the inmmutable Scala List operations is as follows:

Operation Time Complexity
head C ~ O(1)
tail C ~ O(1)
apply L ~ O(n)
update L ~ O(n)
prepend C ~ O(1)
append L ~ O(n)

Just a reminder, we’re using the same notation that’s used in the Scala Docs:

Time Complexity Description
C The operation takes (fast) constant time.
L The operation is linear meaning it takes time proportional to the collection size.

Apart from fetch the head and the tail operations, it’s important to notice that List is very efficient with the prepends (:: operator).

val mainList = List(3, 2, 1)
val with4 = 4 :: mainList    // re-uses mainList, costs one :: instance
val with42 = 42 :: mainList // also re-uses mainList, cost one :: instance
val shorter = mainList.tail    // costs nothing as it uses the same 2::1::Nil instances as mainList

(Code extracted from the Official API Scala Docs).

Another drawback that deserves a mention is that List doesn’t work with parallel algorithms. Typically, we cannot split a List into multiple segments, or concatenate it back, in an efficient manner. Therefore, this is another problem that might need to be considered before making a decision.

### List - Usability

In this case, the dummy code we’re going to use is similar to Seq example, but the difference is that we are going to use the :: operator, defined as a case class, characterized by a head and a tail for non-empty lists:

scala> :paste
// Entering paste mode (ctrl-D to finish)

def matchAndPlusOneList(list: List[Int]): List[Int] = list match {
  case x :: xs => x + 1 :: matchAndPlusOneList(xs)
  case _ => Nil
}

// Exiting paste mode, now interpreting.

matchAndPlusOneList: (list: List[Int])List[Int]

scala> matchAndPlusOneList(List(1, 2, 3, 4, 5, 6))
res1: List[Int] = List(2, 3, 4, 5, 6, 7)

List - Law Abiding

In both, Cats and Scalaz, we can find instances for the List collection. Both instances are lawful and meet the laws for Traverse, Monoidal, Monad and CoflatMap type classes.

List - Conclusions

Use List if all you need are fast prepends and the List will be used properly (see the pitfalls section). If this is not the case, please, continue reading.

Vector

Vector is a general-purpose, immutable data structure. It provides random access and updates in effectively constant time, as well as very fast append and prepend. Because vectors strike a good balance between fast random selections and fast random functional updates, they are currently the default implementation of immutable indexed sequences. It is backed by a little endian bit-mapped vector trie with a branching factor of 32. Locality is very good, but not contiguous, which is good for very large sequences.

Vector - Mutability

As the definition states, Vector structure is immutable. Let’s take a look at the scala package object definition:

type Vector[+A] = scala.collection.immutable.Vector[A]
val Vector = scala.collection.immutable.Vector

Nice and clean, there isn’t much to add, the Vector collection matches our immutability needs.

If we consider that we don’t have immutable collections in Java, ArrayList would be considered the equivalent structure for the Vector, mainly because of the way that it’s implemented.

Vector - Performance

Vector’s access performance is good across the board, leading the performance characteristics to look like this:

Operation Time Complexity
head eC ~ O(1)
tail eC ~ O(1)
apply eC ~ O(1)
update eC ~ O(1)
prepend eC ~ O(1)
append eC ~ O(1)

Comparatively, the head or tail operations will, in general, be slower than using List type, but not by much. You shouldn’t see that as an advantage towards using List. Taking that into account, this means that all operations will effectively take constant time, but this will depend on a few assumptions such as maximum length of a vector, or distribution of hash keys. Vector often reigns supreme concerning memory efficiency for larger sequences. In other words, using Vector only pays if the collection isn’t small. With collections up to a hundred elements, the overall cost might exceed the gain.

In regards to parallel algorithms, Vector controls the parallelism much better than List, thanks to the capacity to manage the data locality.

### Vector - Usability

So, how can we pattern match a Vector? We can use the same extractor used for the head and tail operations to deconstruct other sequences (this operator is actually located on scala.collection package), the example code looks like this:

scala> :paste
// Entering paste mode (ctrl-D to finish)

def matchAndPlusOneVector(v: Vector[Int]): Vector[Int] = v match {
  case x +: xs => (x + 1) +: matchAndPlusOneVector(xs)
  case _ => Vector[Int]()
}

// Exiting paste mode, now interpreting.

matchAndPlusOneVector: (v: Vector[Int])Vector[Int]

scala> matchAndPlusOneVector(Vector(1, 2, 3, 4, 5, 6))
res2: Vector[Int] = Vector(2, 3, 4, 5, 6, 7)

A few things should be mentioned:

  • In the concatenation we had to add parethesis (x + 1) +: .... It’s mandatory, given we would have left and right associative operators with same precedence + and they cannot be be mixed.
  • Vector[Int]() would be equivalent to Nil, even we could make this comparison Vector[Int]() == Nil, since there isn’t any constraint on the type level.

Vector - Law Abiding

As with List in the previous section, Vector collection has instances for both Cats and Scalaz libraries. Therefore, both instances in their functional libraries are lawful and meet the laws for Traverse, Monoidal, Monad and CoflatMap type classes.

Vector - Conclusions

It’s probably obvious that all of these arguments are pointing to a definite recommendation of breaking the habit of using List, and starting to use the Vector collection, as it’s a better choice in most cases.

The Big Picture

With everything in one place, this table could be helpful in determining the appropriate collection type for your project:

Feature/Operation Seq List Vector
Immutable  
Mutable(*)    
       
head C C eC
tail C C eC
apply L L eC
update L L eC
prepend C C eC
append L L eC
insert      
       
Easy Pattern Match
       
Law Abiding  
       
Java Equivalent java.util.List java.util.LinkedList java.util.ArrayList

(*) Mutable or potentially mutable

For the performance characteristics, each symbol used stands for:

  • C - The operation takes (fast) constant time.
  • eC - The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.
  • L - The operation is linear meaning it takes time proportional to the collection size.

Remember, you can read more about Performance Characteristics here.

Some Scala Collections Pitfalls

Even List collections are mutable

This might be confusing as this is a contradiction to what we’ve said in the previous sections, but let’s look at Paul Philip’s Gist for an explanation:

scala> :paste
// Entering paste mode (ctrl-D to finish)

import scala.collection.mutable.ListBuffer

class Ummutable[T] private (buf: ListBuffer[T]) {
  def this(xs: T*) = this(ListBuffer() ++= xs)
  def append(ys: T*): this.type = { buf ++= ys ; this }

  val xs: List[T] = buf.toIterable match { case xs: List[T] => xs }
  override def toString = s"Ummutable(xs = ${xs mkString ", "})"
}

// Exiting paste mode, now interpreting.

import scala.collection.mutable.ListBuffer
defined class Ummutable

scala> val um = new Ummutable(1, 2, 3)
um: Ummutable[Int] = Ummutable(xs = 1, 2, 3)

scala> um append (4, 5, 6)
res3: um.type = Ummutable(xs = 1, 2, 3, 4, 5, 6)

Interesting, isn’t it?

Some other well-explained Problems

All of these issues are explained in detail by Daniel Spiewak in one of his gists.

Alternatives

Scanning the Scala Standard Library, we can find a few alternatives that can be helpful depending on our needs. Here is a reference from the Scalaz library:

  • IList: safe, invariant alternative to stdlib List. Most methods on List have a sensible equivalent here, either on the IList interface itself or via typeclass instances (which are the same as those defined for stdlib List). All methods are total and stack-safe.
  • DList: difference lists, a data structure for O(1) append on lists. Based on Data.DList, a Haskell library by Don Stewart. A difference list is a function that given a list, returns the original contents of the difference list prepended at the given list. This structure supports O(1) append and snoc operations on lists, making it very useful for append-heavy uses, such as logging and pretty printing.
  • NonEmptyList: A singly-linked list that is guaranteed to be non-empty.
  • ListT: ListT monad transformer, useful because Monads do not compose.

Conclusion

To safeguard your code, you should make your variables and collection immutable, preventing uncontrolled side effects.

All things considered, utilizing Vector is the best choice unless you have specific needs where another solution is apparent. Business logic dictates that you should generally, return the most specific immutable types you can (List or Vector), this would be the same rule to consider for typeclasses. In addition to this, avoid both Seq and Iterable because it’s assumed that they’re both mutable (avoid all the interfaces if you will).

References

Ensure the success of your project

47 Degrees can work with you to help manage the risks of technology evolution, develop a team of top-tier engaged developers, improve productivity, lower maintenance cost, increase hardware utilization, and improve product quality; all while using the best technologies.