Back

TechnologyApr 04, 2017

Mastering Scala: Folding

Micah Jones

Scala is rooted in functional programming, which emphasizes immutability: the inability of a variable or object to change its state. Immutability has many benefits to program design: thread safety, functions without side effects, and encouraging concise compartmentalization. However, it does come with a significant shift in mindset if you’re accustomed to writing imperative code with variable reassignment.

Loop structures such as “while” and “for” are particularly common in imperative languages, and they are regularly used to iteratively update the value of a variable. In fact, the classical for loop construction (e.g., for (int i = 0; i < N; i++)) requires variable mutability because it is incrementing an index for each iteration. Scala does support mutable variables, but only begrudgingly; it prefers that we write in a more functional, immutable way.

Fortunately, because Scala excels at working with collections, it provides a number of methods and syntactic features to help us perform iterative operations. We have previously discussed map and flatMap, which translate each element of a collection into a new element in a new collection. A similar method called foreach can perform an operation for each element in a list, and it is effectively similar to map except it does not return a value. Scala even includes an extremely powerful for control structure that can be used in many interesting situations. Yet these operations do share a weakness: the principle of immutability requires each iteration to operate in isolation from all the others. That is, they cannot repeatedly update an object over multiple iterations.

FOLDING

Enter folding. Folding (also called reducing or accumulating) is a mainstay of functional programming, a concise mechanism for generating a single value from a collection of data. It is, in effect, an implicitly recursive operation: We begin with a base value, which is then provided to a function along with the first element of a list, then the output is fed to another instance of that function along with the second element of the list, and so on until we reach the end of the list. This is similar to a map operation in that we are applying a function to each element in a collection, but instead of building a new collection, we are accumulating information from all of the collection’s elements in order to obtain a single, aggregated result.

For example, consider the common problem of finding the sum of all elements in a list. In an imperative language like Java, the solution might be as follows:

int sum = 0; for (int i = 0; i < list.length; i++) { sum = sum + list.get(i); }

In a purely functional setting, we can’t write code like that, because it involves variable reassignment. However, we can look at the core logic to get an idea of what we want to do: For each number in the list, we want to add it to a running total and then return the result after the final addition has occurred.

When writing functional code that iteratively updates a variable (in this case, a sum), we must use recursion. Every instance of a recursive function gets its own copy of the variable, allowing us to update it while preserving immutability. For example, in Scala code:

def sum(total:Int, numbers:List[Int]): Int = if (numbers.isEmpty) total else sum(total+numbers.head, numbers.tail) scala> sum(0, List(4,8,15,16,23,42)) res: Int = 108

This code is properly functional, but it is also bulky for such a simple operation. Fortunately, we have a shorthand option in the form of folding.

We will now use the Scala function foldLeft to collapse our summation logic into one line:

scala> val list = List(4,8,15,16,23,42) scala> list.foldLeft(0) { (sum,element) => sum+element } res: Int = 108

There are two arguments to foldLeft, separated into two parameter sequences: the base starting value of the accumulator (0 in this case), and the function to be executed for each list element. The current accumulator value is submitted as the first argument to the function, and the next element in the list is submitted as the second argument. After each function execution, the result is submitted as the accumulator value to the next iteration. Once the entire list has been traversed, the final accumulator result is returned as the result of the entire folding operation.

Alternately, we can use underscore notation to make this even shorter (each successive underscore represents the next argument to the anonymous function):

scala> list.foldLeft(0) (_+_) res: Int = 108

Here we have used foldLeft, which iterates over the collection from left to right. Alternately, there is foldRight, which iterates from right to left, and regular fold, which iterates in a non-deterministic order.

For situations in which we want to do folding, but we just want to use the first element of the list as our initial accumulator instead of providing one separately, we can use reduceLeft:

scala> list.reduceLeft(_+_) res: Int = 108

AN EXAMPLE

We will now consider a more involved real-world problem. On one of our projects, we found ourselves loading customer information from a database, in which a single customer could have multiple accounts, and each account could have multiple addresses. All of this information was in a single non-normalized table, so the same customer could have several rows in the table, one for each unique (account, address) pair. After loading the table rows into Scala, we wanted to create a unique Customer object for each customer, containing a Map from accounts to addresses.

We could solve this problem using the code below, with map and groupBy. Sample data is provided for example purposes.

scala> case class CustomerRow( // Generated from the database customerId:Int, accountId:Int, address:String) scala> case class Customer( // Desired final data type customerId:Int, accountMap:Map[Int,List[String]]) scala> val customerRows = List( CustomerRow(0,100,"32 Gotham Ln"), CustomerRow(0,100,"45 Metropolis St"), CustomerRow(0,101,"962 Star City Rt"), CustomerRow(1,200,"99 Themyscira Cir")) scala> customerRows.groupBy(_.customerId).map { case (customerId:Int,rows:List[CustomerRow]) => val accountMap = rows.groupBy(_.accountId).map { case (accountId:Int, accountRows:List[CustomerRow]) => (accountId,accountRows.map(_.address)) }.toMap Customer(customerId, accountMap) }.toList res: List[Customer] = List( Customer(1,Map(200 -> List(99 Themyscira Cir))), Customer(0,Map(101 -> List(962 Star City Rt), 100 -> List(32 Gotham Ln, 45 Metropolis St))))

The solution above works, but it is a bit messy and difficult to follow. At the outer level we are grouping rows by customerId and within that we are grouping rows by accountId to generate accountId->List[address] mappings. Each such mapping is used to generate a Customer object for the current customerId. This approach requires multiple nested mapping functions and list traversals.

Below is an alternative solution that uses foldLeft. In this approach we generate a two-level mapping of customerId->(accountId->List[address]), and then do a straightforward conversion to a list of Customer objects. In our folding logic, we begin with an empty, explicitly typed Map, and then iterate through the list of customerRows, updating the Map as we go. For each nesting level of the Map, we use getOrElse to either retrieve an existing entry or generate a new one, adding new information to that entry.

scala> (customerRows.foldLeft(Map[Int,Map[Int,List[String]]]()) { (custMap, row) => val accountMap = custMap.getOrElse(row.customerId, Map()) val accountList = accountMap.getOrElse(row.accountId, List()) custMap + (row.customerId -> (accountMap + (row.accountId -> (accountList :+ row.address)))) } map { case (customer, accountMap) => Customer(customer,accountMap) }).toList res: List[Customer] = List( Customer(0,Map(100 -> List(32 Gotham Ln, 45 Metropolis St), 101 -> List(962 Star City Rt))), Customer(1,Map(200 -> List(99 Themyscira Cir))))

Here we don’t need to do any grouping ourselves, because the Map we’re building does that for us! This allows us to get away with only a single loop structure and much more intuitive logic. In fact, the approach used here is probably similar to one we might use in a typical imperative language.

conclusion

Folding allows us to iteratively update variables without resorting to either the sin of mutability or the bulkiness of explicit recursion. Though it is a powerful tool, it does come with a learning curve. However, when properly understood, folding can be used to solve complex data analysis problems with robust, straightforward code.