Adventures with Scala Collections
by Juan Pedro Moreno Estudillo
- •
- January 29, 2016
- •
- scala• collections• cats• scalaz• functional programming
- |
- 21 minutes to read.

We’d like to cover some ideas and hopefully, provide some clarification, about the different options available for Scala Collections and which ones we recommend using. Keep in mind that recommendations may vary depending on your particular case. This technical decision is not just for Scala beginners, but experimented developers alike, as both can fall victim to bad practices when using collection structures to manage their data.
Scala Collections Overview
Let’s start with an introduction to the Scala Collections (scala.collection package) in the standard library. In general, two types exist:
- Mutable: allows you to update, add, or remove elements of a collection as a side effect - located on the scala.collection.mutable package.
- Immutable: never changes after the collection is created - located on the scala.collection.immutable package.
In most cases, the collections located on the root package scala.collection define common parent interfaces for some mutable and immutable collections. You can find more information in the official Scala docs.
Here are three informative diagrams representing all the collections in the three commented packages. These can also be found in the official Scala docs:
- Collections in
scala.collection
package:
- Collections in
scala.collection.mutable
package:
- Collections in
scala.collection.immutable
package:
(All three figures were generated by Matthias at decodified.com).
As you can see, there are many different traits and classes and we can’t cover them all at once. For our purposes today, we will focus in on the most common forms relating to Seq
. We won’t cover Set
and Map
in this post, but the principle ideas and recommendations we do cover are valid and analogous for all of the above.
Seq, List or Vector, that is the question
In most projects, we have to deal with sequences of objects (User from the database, Shop Products, etc.). However, sometimes as developers, we don’t think about which one is a better fit for the topology of our algorithms, so we don’t see this question arise before development begins. We’d like to change that by putting the various options on the table to motivate people to discuss which option is the best course early on.
Let’s figure out whether Seq
, List
or Vector
better match our needs, based on these four main points:
- Mutability
- Performance
- Usability (Pattern matching)
- Law Abiding
Seq
A base trait for sequences. There isn’t a lot that needs to be said so, let’s discuss our key points.
Seq - Mutability
The first thing to realize is that the default version you’ll be using if you don’t import anything special will be the Seq trait
located on scala.collections.Seq
. In fact, if you look at the code within the scala package object, we can find:
type Seq[+A] = scala.collection.Seq[A]
val Seq = scala.collection.Seq
Keeping this in mind, let’s put together an example where we’re designing our API and have a method that is receiving a Seq
parameter:
def proccessSequence(coll: Seq[Int]) = ...
As you can imagine, our proccessSequence
function can receive both Seq
versions, the immutable one, scala.collection.immutable.Seq
and also the mutable one, scala.collection.mutable.Seq
.
Therefore, if we assume we’re defining an API that’s trying to preserve the immutability principle, we’d be incorrect because we’ll need to add the import scala.collection.immutable.Seq
along all the Scala files to guarantee this immutability principle. I know, this looks like a boilerplate. That’s why we often assume that Seq is mutable by default.
In my humble opinion, that’s not a good thing. Even I would suggest skipping this section and going on to the next one. I’m kidding! This is an imperative feature, and you need to be aware of the default behavior. The responsibility lies on our shoulders as developers, and the import should be completed if we want to use the immutable version.
The reason for this behavior is explained in detail in the Effective Scala Docs:
The default
Traversable
,Iterable
andSeq
types in scope – defined inscala.package
– are thescala.collection
versions, as opposed toMap
andSet
– defined inPredef.scala
– which are thescala.collection.immutable
versions. This means that, for example, the defaultSeq
type can be both the immutable and mutable implementations. Thus, if your method relies on a collection parameter being immutable, and you are usingTraversable
,Iterable
orSeq
, you must specifically require/import the immutable variant, otherwise someone may pass you the mutable version.
Finally, to wrap up this section, if you are a Java
developer just starting with Scala, the Seq
would be comparable to the java.util.List
type in Java
. So in general, let’s say that a Java interface is equivalent to the Scala traits, in this case, we have:
- Scala:
trait Seq[+A]
- Java:
public interface List<E>
It should be noted that Java doesn’t have any equivalent immutable collections.
Seq - Performance
From now on, we’ll take into account the official Scala docs to measure the collections’ performance. For the duration of this post, we’re going to use the same notation:
Time Complexity | Description |
---|---|
C | The operation takes (fast) constant time. |
eC | The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys. |
aC | The operation takes amortized constant time. Some invocations of the operation might take longer, but if many operations are performed on average only constant time per operation is taken. |
Log | The operation takes time proportional to the logarithm of the collection size. |
L | The operation is linear meaning it takes time proportional to the collection size. |
Most of the available operations are linear, meaning it will take time proportional to the collection size, L ~ O(n). O(n)
is expressing time complexity, for instance, the append
operation for Seq
will take linear time. You can learn more about Time Complexity here. This means that if we have an infinite collection, some of the linear operations won’t terminate. In contrast, head
and tail
are efficient, especially the LinearSeq
, which is a sub-trait of Seq
.
head
returns the first element of theLinearSeq
collection, providing fast access to it.tail
returns all elements except the first. Behind the scenes, a new collection without the first element is being returned.
Other optimized and efficient operations are apply
, length
, and (if mutable) update
operations, especially in the in the IndexedSeq
, which is also a subtype of Seq
. These methods are optimized and run faster under the assumption of fast random access with apply
.
So, in general, we can say that Seq
is inefficient by nature because it is considered as an infinite collection, and, added with O(n) time complexity, most operations won’t terminate. In other words, the use of Seq
is not safe in every case because it’s not lazy like Stream
is for example.
Also, it’s important to mention that, Seq
has a few limitations which make it an inadequate collection for parallel programming.
### Seq - Usability
In this section we’ll cover how to pattern match our sequences, for example:
scala> :paste
// Entering paste mode (ctrl-D to finish)
def matchAndPlusOneSequence(s: Seq[Int]): Seq[Int] = s match {
case Seq(x, xs @ _ *) => (x + 1) +: matchAndPlusOneSequence(xs)
case _ => Seq.empty[Int]
}
// Exiting paste mode, now interpreting.
matchAndPlusOneSequence: (s: Seq[Int])Seq[Int]
scala> matchAndPlusOneSequence(Seq(1, 2, 3, 4, 5, 6))
res0: Seq[Int] = List(2, 3, 4, 5, 6, 7)
At this point, things look good given that we’ve defined some extractors.
Seq - Law Abiding
If we’re thinking in purely functional programming terms, it’s important to mention that Seq
doesn’t obey every algebraic law. We might assume that the Seq
is mutable, and in that case, it’s impossible to define lawful typeclass instances for mutable structures because reference identity matters. Scala functional libraries such as Scalaz and Cats, don’t have any instances for the Seq
collections since they have so many inference problems, and algebraic laws are unmet.
In addition, I would like to add one of Rob Morris’ statements from this discussion, which fits in perfectly here:
Even if you stick with
collection.immutable
you find that implementations are very unconstrained; some may obey a given typeclass’s laws and others may not. For instance you could defineFunctor[immutable.Iterable]
, but it wouldn’t be lawful forSet
orMap
. I think this is a general problem with defining instances for non-final data types. Now, you might want to argue that any sensibleimmutable.Iterable
/Seq
will end up being an okayFoldable
since this typeclass has very weak constraints.
Seq - Conclusions
As I’m sure you’ve figured out, Seq
is mutable by default and due to the inefficient and significant features, it’s enough to lose trust in Seq
.
List
A List
is a class used for linked lists representing ordered collections of elements of a specific type. List
is an abstract class (recall Seq
is a trait), subtype of LinearSeq
, as you can see in the previous images.
List - Mutability
Scala List
is immutable by default; that’s the main difference in respect to Scala Seq
.
If we, once again, look at the code within the scala package object, we can see that the default definition is the immutable version:
type List[+A] = scala.collection.immutable.List[A]
val List = scala.collection.immutable.List
For those of you coming from Java, ArrayList
is not the equivalent implementation of Scala List
, though it might be easy to think that. As we’ve pointed out previously, Java doesn’t have indistinguishable immutable collections. Therefore, the better match for ArrayList
would be the scala.collection.mutable.ArrayBuffer
. With that in mind, the Java LinkedList
can be considered as the equivalent (and mutable version) of the Scala List
collection.
List - Performance
Let’s look at the performance characteristics from the Scala Documentation, where the inmmutable Scala List
operations is as follows:
Operation | Time Complexity |
---|---|
head | C ~ O(1) |
tail | C ~ O(1) |
apply | L ~ O(n) |
update | L ~ O(n) |
prepend | C ~ O(1) |
append | L ~ O(n) |
Just a reminder, we’re using the same notation that’s used in the Scala Docs:
Time Complexity | Description |
---|---|
C | The operation takes (fast) constant time. |
L | The operation is linear meaning it takes time proportional to the collection size. |
Apart from fetch the head
and the tail
operations, it’s important to notice that List
is very efficient with the prepends (::
operator).
val mainList = List(3, 2, 1)
val with4 = 4 :: mainList // re-uses mainList, costs one :: instance
val with42 = 42 :: mainList // also re-uses mainList, cost one :: instance
val shorter = mainList.tail // costs nothing as it uses the same 2::1::Nil instances as mainList
(Code extracted from the Official API Scala Docs).
Another drawback that deserves a mention is that List
doesn’t work with parallel algorithms. Typically, we cannot split a List
into multiple segments, or concatenate it back, in an efficient manner. Therefore, this is another problem that might need to be considered before making a decision.
### List - Usability
In this case, the dummy code we’re going to use is similar to Seq
example, but the difference is that we are going to use the ::
operator, defined as a case class
, characterized by a head
and a tail
for non-empty lists:
scala> :paste
// Entering paste mode (ctrl-D to finish)
def matchAndPlusOneList(list: List[Int]): List[Int] = list match {
case x :: xs => x + 1 :: matchAndPlusOneList(xs)
case _ => Nil
}
// Exiting paste mode, now interpreting.
matchAndPlusOneList: (list: List[Int])List[Int]
scala> matchAndPlusOneList(List(1, 2, 3, 4, 5, 6))
res1: List[Int] = List(2, 3, 4, 5, 6, 7)
List - Law Abiding
In both, Cats and Scalaz, we can find instances for the List
collection. Both instances are lawful and meet the laws for Traverse
, Monoidal
, Monad
and CoflatMap
type classes.
List - Conclusions
Use List
if all you need are fast prepends and the List
will be used properly (see the pitfalls section). If this is not the case, please, continue reading.
Vector
Vector
is a general-purpose, immutable data structure. It provides random access and updates in effectively constant time, as well as very fast append and prepend. Because vectors strike a good balance between fast random selections and fast random functional updates, they are currently the default implementation of immutable indexed sequences. It is backed by a little endian bit-mapped vector trie with a branching factor of 32. Locality is very good, but not contiguous, which is good for very large sequences.
Vector - Mutability
As the definition states, Vector
structure is immutable. Let’s take a look at the scala package object definition:
type Vector[+A] = scala.collection.immutable.Vector[A]
val Vector = scala.collection.immutable.Vector
Nice and clean, there isn’t much to add, the Vector
collection matches our immutability needs.
If we consider that we don’t have immutable collections in Java
, ArrayList
would be considered the equivalent structure for the Vector
, mainly because of the way that it’s implemented.
Vector - Performance
Vector
’s access performance is good across the board, leading the performance characteristics to look like this:
Operation | Time Complexity |
---|---|
head | eC ~ O(1) |
tail | eC ~ O(1) |
apply | eC ~ O(1) |
update | eC ~ O(1) |
prepend | eC ~ O(1) |
append | eC ~ O(1) |
Comparatively, the head
or tail
operations will, in general, be slower than using List
type, but not by much. You shouldn’t see that as an advantage towards using List
. Taking that into account, this means that all operations will effectively take constant time, but this will depend on a few assumptions such as maximum length of a vector, or distribution of hash keys. Vector
often reigns supreme concerning memory efficiency for larger sequences. In other words, using Vector
only pays if the collection isn’t small. With collections up to a hundred elements, the overall cost might exceed the gain.
In regards to parallel algorithms, Vector
controls the parallelism much better than List
, thanks to the capacity to manage the data locality.
### Vector - Usability
So, how can we pattern match a Vector
? We can use the same extractor used for the head
and tail
operations to deconstruct other sequences (this operator is actually located on scala.collection
package), the example code looks like this:
scala> :paste
// Entering paste mode (ctrl-D to finish)
def matchAndPlusOneVector(v: Vector[Int]): Vector[Int] = v match {
case x +: xs => (x + 1) +: matchAndPlusOneVector(xs)
case _ => Vector[Int]()
}
// Exiting paste mode, now interpreting.
matchAndPlusOneVector: (v: Vector[Int])Vector[Int]
scala> matchAndPlusOneVector(Vector(1, 2, 3, 4, 5, 6))
res2: Vector[Int] = Vector(2, 3, 4, 5, 6, 7)
A few things should be mentioned:
- In the concatenation we had to add parethesis
(x + 1) +: ...
. It’s mandatory, given we would have left and right associative operators with same precedence+
and they cannot be be mixed. Vector[Int]()
would be equivalent toNil
, even we could make this comparisonVector[Int]() == Nil
, since there isn’t any constraint on the type level.
Vector - Law Abiding
As with List
in the previous section, Vector
collection has instances for both Cats and Scalaz libraries. Therefore, both instances in their functional libraries are lawful and meet the laws for Traverse
, Monoidal
, Monad
and CoflatMap
type classes.
Vector - Conclusions
It’s probably obvious that all of these arguments are pointing to a definite recommendation of breaking the habit of using List
, and starting to use the Vector
collection, as it’s a better choice in most cases.
The Big Picture
With everything in one place, this table could be helpful in determining the appropriate collection type for your project:
Feature/Operation | Seq | List | Vector |
---|---|---|---|
Immutable | ✓ | ✓ | |
Mutable(*) | ✓ | ||
head | C | C | eC |
tail | C | C | eC |
apply | L | L | eC |
update | L | L | eC |
prepend | C | C | eC |
append | L | L | eC |
insert | |||
Easy Pattern Match | ✓ | ✓ | ✓ |
Law Abiding | ✓ | ✓ | |
Java Equivalent | java.util.List |
java.util.LinkedList |
java.util.ArrayList |
(*) Mutable or potentially mutable
For the performance characteristics, each symbol used stands for:
C
- The operation takes (fast) constant time.eC
- The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.L
- The operation is linear meaning it takes time proportional to the collection size.
Remember, you can read more about Performance Characteristics here.
Some Scala Collections Pitfalls
Even List
collections are mutable
This might be confusing as this is a contradiction to what we’ve said in the previous sections, but let’s look at Paul Philip’s Gist for an explanation:
scala> :paste
// Entering paste mode (ctrl-D to finish)
import scala.collection.mutable.ListBuffer
class Ummutable[T] private (buf: ListBuffer[T]) {
def this(xs: T*) = this(ListBuffer() ++= xs)
def append(ys: T*): this.type = { buf ++= ys ; this }
val xs: List[T] = buf.toIterable match { case xs: List[T] => xs }
override def toString = s"Ummutable(xs = ${xs mkString ", "})"
}
// Exiting paste mode, now interpreting.
import scala.collection.mutable.ListBuffer
defined class Ummutable
scala> val um = new Ummutable(1, 2, 3)
um: Ummutable[Int] = Ummutable(xs = 1, 2, 3)
scala> um append (4, 5, 6)
res3: um.type = Ummutable(xs = 1, 2, 3, 4, 5, 6)
Interesting, isn’t it?
Some other well-explained Problems
All of these issues are explained in detail by Daniel Spiewak in one of his gists.
Alternatives
Scanning the Scala Standard Library, we can find a few alternatives that can be helpful depending on our needs. Here is a reference from the Scalaz library:
- IList: safe, invariant alternative to stdlib
List
. Most methods onList
have a sensible equivalent here, either on theIList
interface itself or via typeclass instances (which are the same as those defined for stdlibList
). All methods are total and stack-safe. - DList: difference lists, a data structure for
O(1)
append on lists. Based onData.DList
, a Haskell library by Don Stewart. A difference list is a function that given a list, returns the original contents of the difference list prepended at the given list. This structure supportsO(1)
append and snoc operations on lists, making it very useful for append-heavy uses, such as logging and pretty printing. - NonEmptyList: A singly-linked list that is guaranteed to be non-empty.
- ListT: ListT monad transformer, useful because Monads do not compose.
Conclusion
To safeguard your code, you should make your variables and collection immutable, preventing uncontrolled side effects.
All things considered, utilizing Vector
is the best choice unless you have specific needs where another solution is apparent. Business logic dictates that you should generally, return the most specific immutable types you can (List
or Vector
), this would be the same rule to consider for typeclasses. In addition to this, avoid both Seq
and Iterable
because it’s assumed that they’re both mutable (avoid all the interfaces if you will).
References
- Scala Documentation - https://docs.scala-lang.org/overviews/collections/introduction.html
- Scala Documentation about performance - https://docs.scala-lang.org/overviews/collections/performance-characteristics.html
- Cats Issue 277 Discussion - https://github.com/typelevel/cats/issues/277
- Scala API Docs - https://www.scala-lang.org/api/current/index.html