Variations for computing results from sequences in Scala
Introduction
A common question from students who are new to Scala is: What is the difference between using the map function on lists, using for expressions and foreach loops? One of the major sources of confusion with regard to this question is that a for expression in Scala in not the equivalent of for loops in languages like Python and Java — instead, the equivalent of for loops is foreach in Scala. This distinction highlights the importance of understanding what it means to return values versus relying on side-effects to perform certain computations. It also helps reinforce some points about fixed versus reassignable variables and immutable versus mutable data structures.
The task and its functional solution
To demonstrate this, let’s consider a simple task. Given a List of words, compute two lists: one has the lengths of each word and the second indicates whether a word starts with a capital letter or not. For example, start with the following list.
scala> val words = List("This", "is", "a", "list", "of", "English", "words", ".") words: List[java.lang.String] = List(This, is, a, list, of, English, words, .)
We can compute the two lists by mapping over the words list as follows.
scala> words.map(_.length) res0: List[Int] = List(4, 2, 1, 4, 2, 7, 5, 1) scala> words.map(_(0).isUpper) res1: List[Boolean] = List(true, false, false, false, false, true, false, false)
So, that’s it. However, let’s do this without using different calls to the map function (or multiple foreach loops, as we’ll see below). The easiest way to do this is to map each word to a tuple containing the length and the Boolean indicating whether its first character is capitalized; this produces a list of tuples, which we unzip to get a tuple of Lists.
scala> val (wlengthsMapUnzip, wcapsMapUnzip) = | words.map(word => (word.length, word(0).isUpper)).unzip wlengthsMapUnzip: List[Int] = List(4, 2, 1, 4, 2, 7, 5, 1) wcapsMapUnzip: List[Boolean] = List(true, false, false, false, false, true, false, false)
The key thing here is that the map function turns the List[String] words into a List[(Int, Boolean)] — which is to say it returns a value. We can assign that value to a variable, or use it immediately by calling unzip on it, which in turn returns a value that is a Tuple2(List[Int],List[Boolean]).
Before moving on let’s define a simple function to display the results of performing this computation, which we will do in various ways (and which all produce precisely the same results).
def display (intro: String, wlengths: List[Int], wcaps: List[Boolean]) { println(intro) println("Lengths: " + wlengths.mkString(" ")) println("Caps: " + wcaps.mkString(" ")) println }
Calling this function with the result of mapping and unzipping as above, we get the following output.
scala> display("Using map and unzip.", wlengthsMapUnzip, wcapsMapUnzip) Using map and unzip. Lengths: 4 2 1 4 2 7 5 1 Caps: true false false false false true false false
Okay, so now let’s start doing it the hard way. Rather than mapping over the original list, we’ll loop over the list with foreach, and perform a side-effect computation that builds up the two result sequences. This is the sort of thing that is typically done in non-functional languages with for loops, hence the use of foreach in Scala. We’ll explore each of these in turn.
The second variation: use reassignable, immutable Lists
We can use reassignable variables which are initialized to be empty Lists, and then prepend to them as we loop through the words list. We are thus using a variable that has the type of List, which is an immutable sequence data structure, but its value is being reassigned each time we pass through the loop.
var wlengthsReassign = List[Int]() var wcapsReassign = List[Boolean]() words.foreach { word => wlengthsReassign = word.length :: wlengthsReassign wcapsReassign = word(0).isUpper :: wcapsReassign } display("Using reassignable lists.", wlengthsReassign.reverse, wcapsReassign.reverse)
Note that we build up the lists by prepending, which means they come out of the loop in reverse order and thus must be reversed before being displayed. You can of course append to a List by creating a singleton List and concatenating the two Lists with the ::: operator.
scala> val foo = List(4,2) foo: List[Int] = List(4, 2) scala> foo ::: List(7) res0: List[Int] = List(4, 2, 7)
However, this is not recommended because it is computationally costly. Adding an element to the front (left) of a List is a constant time operation, whereas concatenating two lists requires time proportional to the length of the first list. That might not seem like a big deal until you are dealing with lists with thousands of elements, and then you’ll find that the same bit of code that prepends many times and then reverses is much faster than one which appends using the above strategy.
Third variation: use unreassignable, mutable (growable) ListBuffers
Next, we can use a ListBuffer, which is a mutable sequence data structure that also happens to support constant time append operations. We can thus declare it as a val, and then use the method append to mutate the sequence so that it has a new element at the end. So, the variables referring to the sequences are not reassignable, but their values are mutable.
import collection.mutable.ListBuffer val wlengthsBuffer = ListBuffer[Int]() val wcapsBuffer = ListBuffer[Boolean]() words.foreach { word => wlengthsBuffer.append(word.length) wcapsBuffer.append(word(0).isUpper) } display("Using mutable ListBuffer.", wlengthsBuffer.toList, wcapsBuffer.toList)
Note that we must convert the ListBuffers to Lists for the call to display in order to have the right types as arguments to that function.
Since they can efficiently grow (i.e., get longer), ListBuffers are a good choice for many problems where we need to accumulate a set of results, and especially when we don’t know how many results we will be accumulating. However, if you know the number of results you’ll be accumulating it’s probably better to use Arrays, as shown next.
Fourth variation: use unreassignable, mutable (but fixed length) Arrays
Both of the above alternatives probably look a little strange to people coming from Java. In Java, you’d be more likely to do an imperative solution that involves initializing arrays that have the same length as words and then filling in respective indices as appropriate. To do this, use Array.fill(lengthOfArray)(initialValue).
val wlengthsArray = Array.fill(words.length)(0) val wcapsArray = Array.fill(words.length)(false) words.indices.foreach { index => wlengthsArray(index) = words(index).length wcapsArray(index) = words(index)(0).isUpper } display("Using iteration and arrays.", wlengthsArray.toList, wcapsArray.toList)
We go through the indices and for each one compute the value and assign it to the appropriate index in the corresponding Array. Again, we need to convert the results to Lists before calling display. The indices method does exactly what you’d expect — it gives you the indices of the List.
scala> words.indices res2: scala.collection.immutable.Range = Range(0, 1, 2, 3, 4, 5, 6, 7)
A problem with the above foreach loop is that it requires indexing into Lists, which is generally a bad idea. Why? Because to get the i-th item from a list requires time proportional to i operations. Why? Because the implementation for obtaining an item at a particular index i involves peeling off the head of the list to get its tail, and then seeking for the i-1-th item of the tail, which requires peeling off its head and then seeking for the i-2-th item, and so on. So, if you want to get the 10000th item in a list, you have to perform 10,000 operations to get it. If the words list had 10,000 elements, you can now see that you’d perform 10,000 basic computations just on the foreach, and for each element you do 2*index operations to get the word at that index, which means doing 20,000 operations on the last index alone.
Note that indexing into Arrays is a constant time operation, so there is no problem with the left hand side of the assignments in the above loop.
You might think you can do better by first storing the word and then using it twice, e.g.
words.indices.foreach { index => val word = words(index) wlengthsArray(index) = word.length wcapsArray(index) = word(0).isUpper }
This is better, but it only saves us half the operations. Since we were perfectly happy to loop over the words themselves before, we actually shouldn’t have to do this look up — we can do better by having a reassignable counter index that allows us to set values to the correct positions in the new Arrays we are creating.
val wlengthsArray2 = Array.fill(words.length)(0) val wcapsArray2 = Array.fill(words.length)(false) var index = 0 words.foreach { word => wlengthsArray2(index) = word.length wcapsArray2(index) = word(0).isUpper index += 1 }
Since this sort of pattern is fairly common, Scala provides a handy method on sequences called zipWithIndex which returns a List of the original elements paired with their indices.
scala> words.zipWithIndex res3: List[(java.lang.String, Int)] = List((This,0), (is,1), (a,2), (list,3), (of,4), (English,5), (words,6), (.,7))
In this way, we can have the foreach loop over such pairs. It is convenient in these cases to use the pattern matching abilities in foreach loops by using the case match on pairs, as below.
val wlengthsArray3 = Array.fill(words.length)(0) val wcapsArray3 = Array.fill(words.length)(false) words.zipWithIndex.foreach { case(word,index) => wlengthsArray3(index) = word.length wcapsArray3(index) = word(0).isUpper }
It’s important to understand the cost of the operations you are using, especially in looping contexts where you are inherently doing the same basic operation multiple times.
Indexed sequences (Vectors)
It is worth pointing out that when you want an immutable sequence that allows efficient indexing, you should use Vectors.
scala> val bar = Vector(1,2,3) bar: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3)
If you have a List in hand but want to index into it repeatedly, you can convert it to a Vector using toIndexedSeq.
scala> val numbers = List(4,9,9,2,3,8) numbers: List[Int] = List(4, 9, 9, 2, 3, 8) scala> numbers.toIndexedSeq res5: scala.collection.immutable.IndexedSeq[Int] = Vector(4, 9, 9, 2, 3, 8)
IndexedSeq is a supertype of sequences which are designed to be efficient for indexing, and Vector is the default “backing” implementation when you call toIndexedSeq on a List.
Of course, if you are only ever going over all the elements of a sequence in order, then Lists are likely to be preferable since they have a bit less overhead and they have some nice properties for pattern matching in match statements.
Using predefined funtions for mapping over a sequence
Another thing worth pointing out is that if you have a predefined function, you can pass that as the argument to map, which can lead to very concise code for this task. Assume you have defined a function that takes a String and produces a Tuple of its argument’s length and whether it starts with an upper-case letter.
def getLengthAndUpper = (word: String) => (word.length, word(0).isUpper)
The code for mapping over words with this function to get our desired lists is then very clean.
val (wlengthsFunction, wcapsFunction) = words.map(getLengthAndUpper).unzip
Of course, you would probably only do this if you needed that same function in other places. If not, it’s preferable to just use the anonymous function like in the first map example in this blog post. However, you can see that if you have a library of simple functions like this, you can now start writing much clearer and simpler code by reusing them when mapping over different lists.
For expressions
Notice that the previous loops were all foreach ones, whereas Java programmers and Pythonistas will be used to for loops. Scala doesn’t have for loops — it has for expressions. A common question then is: What’s the difference? What is a for expression for and why isn’t it a for loop? The difference is that an expression returns a value, so while foreach allows you to plow through a sequence and do some operation to each element, a for expression allows you to return a value for each element. Consider the following, in which we yield the square of each integer in a List[Int].
scala> val numbers = List(4,9,9,2,3,8) numbers: List[Int] = List(4, 9, 9, 2, 3, 8) scala> for (num <- numbers) yield num*num res6: List[Int] = List(16, 81, 81, 4, 9, 64)
We get a result, whereas a foreach loop just does the computation and returns nothing.
scala> numbers.foreach { num => num * num }
The key is that we yield a value for each element in the for expression. In this case, it is basically equivalent to using map. Here it is in the context of the running words example.
val (wlengthsFor, wcapsFor) = (for (word <- words) yield (word.length, word(0).isUpper)).unzip display("Using a for expression.", wlengthsFor, wcapsFor)
Having said all this, it turns out you can use a for expression as a loop, without returning any values, e.g. as follows.
scala> for (num <- numbers) { println(num*num) } 16 81 81 4 9 64
I think it is generally better to use a foreach loop for such cases so that it is clear that you are only performing side-effects, like printing, reassigning the values of var variables, or modifying mutable data structures. However, there are some cases where a for expression can be more convenient, for example when working through multiple lists and doing various filtering operations. Here’s a quick example to give a flavor of this. Given two lists, we can enumerate the cross product of all their elements
scala> val numbers = List(4,9,9,2,3,8) numbers: List[Int] = List(4, 9, 9, 2, 3, 8) scala> val letters = List('a','C','f','d','z') letters: List[Char] = List(a, C, f, d, z) scala> for (n <- numbers; l <- letters) print("(" + n + "," + l + ") ") (4,a) (4,C) (4,f) (4,d) (4,z) (9,a) (9,C) (9,f) (9,d) (9,z) (9,a) (9,C) (9,f) (9,d) (9,z) (2,a) (2,C) (2,f) (2,d) (2,z) (3,a) (3,C) (3,f) (3,d) (3,z) (8,a) (8,C) (8,f) (8,d) (8,z)
You can filter on these values as well to restrict the output to just some reduced set of elements of inter(est.
scala> for (n <- numbers; if (n>4); l <- letters) print("(" + n + "," + l + ") ") (9,a) (9,C) (9,f) (9,d) (9,z) (9,a) (9,C) (9,f) (9,d) (9,z) (8,a) (8,C) (8,f) (8,d) (8,z)
There is much more to this, but I’ll leave it here since using for expressions in this way is a rich enough topic for several blog posts in and of itself. Also, there is a detailed discussion of it in Odersky, Spoon, and Venner’s book “Programming in Scala.”
Fifth variation: use a recursive function
It’s worth pointing out one other way of building up lengths and caps lists. Recursive functions are functions which look at their input and then either return a result for a base case or compute a result and then call themself with that result. It’s pretty standard stuff that computer scientists love and which tends to get used a lot more in functional programming than in imperative programming. Here, I’ll show how to do the same task done before using recursion, but without an in-depth explanation, so either you’ll already know how to do recursion and you can see it in Scala for the same problem context as above, or you don’t know much about recursion but can use this as an example of how it is employed for a task you already understand from the vantages given above. So, in the later case, hopefully it will be useful in conjunction with other tutorials on recursion.
First, we need to define the recursive function, given below. It has three parameters: one for the list of words, one for the already computed lengths and another for the already computed caps. It returns a pair that has first the list of lengths with one additional item prepended to it and then the list of caps values with one additional item prepended to it. The items being prepended are computed from the head of the inputWords list.
def lengthCapRecursive( inputWords: List[String], lengths: List[Int], caps: List[Boolean]): (List[Int], List[Boolean]) = inputWords match { case Nil => (lengths, caps) case head :: tail => lengthCapRecursive(tail, head.length :: lengths, head(0).isUpper :: caps) }
We can call this function directly, but it is often convenient to provide a secondary function that makes the initial call to this function with empty result lists as the second and third parameters. The secondary function can then perform the reversal and return the desired computed lists.
def lengthCapRecursive(inputWords: List[String]): (List[Int], List[Boolean]) = { val (l,c) = lengthCapRecursive(words, List[Int](), List[Boolean]()) (l.reverse, c.reverse) }
Getting the result is then just a matter of calling that function with our words list.
val (wlengthsRecursive, wcapsRecursive) = lengthCapRecursive(words) display("Using a recursive function.", wlengthsRecursive, wcapsRecursive)
A slight variation on this that is slightly cleaner is to “hide” the recursive function inside the secondary function, which then effectively acts as a wrapper to the recursive function. This is often considered cleaner because the programmer can ensure that the initialization is done correctly and that the recursive function itself isn’t given malformed inputs.
def lengthCapRecurWrap(inputWords: List[String]): (List[Int], List[Boolean]) = { // This function is hidden from code that doesn't def lengthCapRecurHelp( inputWords: List[String], lengths: List[Int], caps: List[Boolean]): (List[Int], List[Boolean]) = inputWords match { case Nil => (lengths, caps) case head :: tail => lengthCapRecurHelp(tail, head.length :: lengths, head(0).isUpper :: caps) } val (l,c) = lengthCapRecursive(words, List[Int](), List[Boolean]()) (l.reverse, c.reverse) } val (wlengthsRecurWrap, wcapsRecurWrap) = lengthCapRecurWrap(words) display("Using a recursive function contained in a wrapper.", wlengthsRecurWrap, wcapsRecurWrap)
Conclusion
So, that provides an overview of different ways of obtaining the same results and some explanation of the different properties of each solution in terms of computational considerations that are likely to crop up in your code and you should be aware of.
Clearly there are many ways of getting the same thing done in Scala. This can be hard for newcomers to the language since they don’t have good intuitions about which approach is better in different circumstances, but it is quite valuable to have these options as you become more savvy and understand what the costs and benefits of using different data structures and different ways of iterating are.
All of the code from the above snippets are gathered together in the Github gist ListComputations.scala. You can save it as a file and run it as “scala ListComputations.scala“ to see the output and play around with modifications to the code.
Reference: Variations for computing results from sequences in Scala from our JCG partner Jason Baldridge at the Bcomposes blog.
@google-9d8cf9bb04c5f36a0cc9e7e64d6597ba:disqus That’s a pointer back to this post. Can you provide the link to the results?