Monday, July 26, 2010

Translating Code from Python and Scheme to Clojure

When coming to Clojure from another language, it takes a while before you start "thinking in Clojure". While ramping up, it helps to understand how to solve a problem in a language you're already familiar with, and then translate the code in some methodical fashion into Clojure.

This article will look at a simple function, remove-first, and look at how you would implement that function in Python and Scheme, and then how to methodically transform those implementations into Clojure. In all of these implementations, I'm going to ignore ways to write the function using shortcuts provided by the standard library, and focus on the implementations using standard iteration and/or recursive techniques. This will provide the clearest example of how the translation process works and can generalize to other types of functions.

Problem Statement: remove-first takes an item and a collection, and returns a new collection which is identical to the original, except the first instance of item (if any) has been removed from the collection. If item is not in the collection, the new collection should be identical to the original (since nothing needs to be removed).

First, let's look at the Python implementation. Python's primary collection data structure is called a "list", but this is a bit of a misnomer, because in most languages, the term "list" is used to describe some sort of linked or doubly-linked list. Python's list is nothing of the sort. Python's list allows fast (destructive) insertion and removal at the back end of the list, and fast lookup by index. In most languages, this would be called an extensible array or extensible vector (in Java, it's called an ArrayList).

The problem statement for remove-first calls for returning a new copy of the collection with the first instance of item removed. Is this really idiomatic for Python? It is certainly possible to write remove-first as a destructive function that actually modifies the original collection by removing the first instance of item. In fact, such a destructive method is built-in to the list class (list.remove(item)). But removing from the middle of a Python list is not an especially efficient operation, and Python has a culture of slices, comprehensions, and many other list operations that return fresh copies. So yes, I think it is reasonable to talk about how to write a non-destructive removal in Python.

Now remember, for the purposes of this article, we're trying to look at how to translate iterative algorithms, so it's a cheat to use built-in constructs that work around this.

So this doesn't count, because it uses the built-in destructive removal:
def removeFirst(itemToRemove, coll):
newColl = coll[:] # copies the collection
try:
newColl.remove(itemToRemove) # throws an error if item is not present
return newColl
except:
return newColl


Nor does this, although this is arguably the most idiomatic Python version of removeFirst, because it uses the built in index function which handles the iteration behind the scenes:
def removeFirst(itemToRemove, coll):
try:
i = coll.index(itemToRemove)
return coll[:i] + coll[i+1:]
except:
return coll[:]


In Python, the standard practice for writing such a function in an iterative fashion is to create a new list, and then iterate over the items of the initial list, adding the appropriate ones to the new list, and then returning the new list at the end. This pattern of setting up an accumulator, and then using a for loop to add things to the accumulator, is a common one in Python. For this particular problem, in English you might say, "I'm going to go through the items, adding them one at a time to the new collection. If I hit one that matches the item to remove, I skip it and add the rest of the items directly to the new collection."

But is it better to iterate directly over the items, or is it better to iterate over the indices and access the items through the indices? When possible, it's preferred to iterate directly over the items, but unfortunately, Python has no good way to express "the rest of the items" once you hit an item that matches the one to remove. You can only get at "the rest of the items" if you know the index in order to take a slice from there to the end. So iterating through items versus iterating through indices yield slightly different strategies for this particular problem.

If you really wanted to iterate directly over the items, you'd probably need to use a flag to track whether you've already removed the first occurrence of the item:

def removeFirst(itemToRemove, coll):
newCollection = []
alreadyRemovedItem = False
for item in coll:
if (alreadyRemovedItem):
# We've already removed the first instance of item
# so we're just in "copy" mode
newCollection.append(item)
else:
# We need to test whether the item matches itemToRemove
if (item == itemToRemove):
# don't copy this item over, but set alreadyRemovedItem flag
alreadyRemovedItem = True
else:
newCollection.append(item)
return newCollection


which could be refactored by reorganizing the if/else branches into the shorter:

def removeFirst(itemToRemove, coll):
newCollection = []
alreadyRemovedItem = False
for item in coll:
if (alreadyRemovedItem or item != itemToRemove):
newCollection.append(item)
else:
alreadyRemovedItem = True
return newCollection


Alternatively, if you iterate through the indices, then you can use a list slice to capture the notion of "the rest of the list".

def removeFirst(itemToRemove, coll):
newCollection = []
for index in range(len(coll)):
item = coll[index]
if (item == itemToRemove):
# skip this item and rather than add the rest of the items
# one by one, we can just add the rest to newCollection
# all in one step
newCollection.extend(coll[index+1:])
# We're done now
return newCollection
else:
newCollection.append(item)
return newCollection


Admittedly, this last version is a bit of a cheat by my own definition of avoiding built-ins that bypass iteration, since that's effectively what extend is doing. But I don't mind it as much since we at least had to iterate until we found the item to remove, so it sufficiently demonstrates iteration techniques.

I think either of these versions (iterating through items or iterating through indices) are good examples of common patterns that occur in Python, and are worth knowing how to translate to Clojure.

Let's begin by translating the version that iterates through items.

Clojure has a built-in datastructure that corresponds very closely to Python's lists. In Clojure, it is called a vector. Like Python lists, it allows fast access to an element by index, and allows fast insertion and removal at the back end of the collection. The main difference is that a Clojure vector, like all built-in Clojure data structures, is persistent and immutable. That means that any operation on a Clojure vector returns some sort of fresh copy -- the original remains unchanged. At first, this might sound wildly inefficient, but altered Clojure vectors can share a lot of internal structure with their source, precisely because of this guarantee of immutability, so it's actually pretty fast.

But this requires a different way of thinking about algorithms. We can't just create a new empty vector and destructively add things to it and return it, as we do with Python. (Well, of course, since Clojure sits on top of Java, you can easily use Clojure to create a Java ArrayList and use exactly the same pattern as Python, but you're here to learn "the Clojure way", right?)

Perhaps the most naive way to translate the code is to simulate a mutable vector in Clojure by wrapping a vector in some sort of mutable reference type (e.g., an atom). You'd need to do the same thing to the already-removed-flag. The @ sign is then used to look at the current contents of the mutable reference. This is very bad form for this kind of algorithm, and not particularly efficient, but it works, and has an almost exact parallel to the Python code:

; This is bad style, don't do this!!!
(defn remove-first [item-to-remove coll]
(let [new-collection (atom []),
already-removed-item (atom false)]
; doseq is just like Python's for loop
(doseq [item coll]
(if (or @already-removed-item (not= item item-to-remove))
(swap! new-collection conj item) ; just like append
(reset! already-removed-item true))) ; sets flag to true
@new-collection))



The better way to tackle this translation is to closely analyze the original, and identify any accumulator or variable that changes as you loop, and then thread those through the loop. I'll explain shortly what I mean by threading values through the loop, but there's one other issue that needs to be considered. You've already seen Clojure's doseq construct in action as a counterpart to Python's for loop, but it's only relevant for triggering destructive or side-effect-filled actions -- it's not the right tool for the job when trying to build up a persistent Clojure vector. Clojure has a for construct but it's not a general looping construct, rather, it corresponds to Python's list and generator comprehensions.

Clojure really only has one general-purpose looping construct, known as loop/recur. In time, you'll be able to see how to go directly from a Python-style for loop to a Clojure loop/recur, but initially, it's far easier to see how to go from a Python-style while loop to a Clojure loop/recur. So as an intermediary step in translating our Python code to Clojure, let's begin by rewriting the Python for loop into a while loop. Here is one way to do that rewrite:

def removeFirst(itemToRemove, coll):
newCollection = []
alreadyRemovedItem = False
collIterator = iter(coll)
while(True):
try:
item = collIterator.next()
if (alreadyRemovedItem or item != itemToRemove):
newCollection.append(item)
else:
alreadyRemovedItem = True
except:
# next() triggers an error when the end of the list is reached
# so we're done
return newCollection


The advantage of rewriting as a while loop is that it helps us identify a couple of very important things. First, it let's us see the exit condition. The exit condition occurs when you've reached the end of the list, at which point newCollection holds the answer and can be returned. Second, it makes it a tad easier to analyze which things from outside the loop are being updated while inside the loop. The assignment to item is just creating a local variable within a given iteration of the loop, so that's not really the kind of thing we're looking for. But we should take note that newCollection, initialized outside the loop, is destructively extended within the loop, and the flag alreadyRemovedItem can also change from within the loop. Furthermore, it's now quite obvious that something needs to track the iteration through coll; this is done by collIterator which is updated each time through the while loop by the call to its next() method. So collIterator, newCollection and alreadyRemovedItem are the things we're going to need to thread through our Clojure loop/recur structure.

Once these things have been identified, we're ready to tackle the Clojure translation. Iterators in Clojure work quite differently than in Python. Almost any collection in Clojure can be converted to a "seq" (short for sequence) by calling a function called, you guessed it, seq. For the moment, go ahead and think of it as an iterator. The iterator will be nil when the collection is exhausted. You call first to get the item the iterator is pointing at, and next to (non-destructively) advance the iterator.

(defn remove-first [item-to-remove coll]
; Start a loop, identifying and initializing the things that will change in the loop
(loop [coll-iterator (seq coll),
new-collection [],
already-removed-item false]
; coll-iterator is nil when you're at the end of the collection.
; Clojure treats all non-nil, non-false values as true.
(if coll-iterator
; We haven't reached the end of the collection
(let [item (first coll-iterator)]
(if (or already-removed-item (not= item item-to-remove))
; update new-collection and advance coll-iterator. We do this using recur
; to jump back to loop, rebinding coll-iterator to (next coll-iterator),
; new-collection to (conj new-collection item), and
; leaving already-removed-item unchanged
(recur (next coll-iterator) (conj new-collection item) already-removed-item)

; else, advance iterator, leave new-collection unchanged,
; and set already-removed-item to true
(recur (next coll-iterator) new-collection true)))
; We have reached the end of collection, so new-collection is the answer
new-collection)))


Note that nothing destructive is happening here; it's all mutation-free. (next coll-iterator) actually returns a new iterator object, (conj new-collection item) actually creates a new vector with item appended to the end. Clojure makes these operations cheap, and recur lets us pass the new objects back to the top of the loop and reuse the names given in the loop construct.

Now I'll let you in on a little secret. Rather than thinking of (seq coll) as returning an iterator, you can think of it as returning a linked-list-style view of the collection. Better yet, virtually all sequential-style functions in Clojure call seq implicitly, so that means, for all practical purposes, you can pretend that any Clojure collection is a linked list. Lists are able to answer three important questions: are you empty, what is your first element, and what are the rest of your elements? These correspond to empty?, first, and rest in Clojure. So just by thinking of our collection as a list, we have an extremely powerful way to iterate through it using recursion. We can rewrite the above Clojure code with this in mind:

(defn remove-first [item-to-remove coll]
(loop [coll coll, ; no explicit call to seq is needed,
; we can reuse the name coll for clarity
new-collection [],
already-removed-item false]
(if (empty? coll)
new-collection
(let [item (first coll)]
(if (or already-removed-item (not= item item-to-remove))
(recur (rest coll) (conj new-collection item) already-removed-item)
(recur (rest coll) new-collection true))))))


This concludes our mechanical translation of the iterate-through-items Python version. It may look odd to you if you've never seen this kind of looping before. Some programmers actively prefer this style of looping because it makes it abundantly clear which things are changing each time through the loop, and how, and precisely what the exit condition is and what value you exit with. In other words, it's arguably more explicit and easier to analyze a loop/recur structure than a for loop that's mucking around with mutable objects located outside the loop. There's truth to this, but I also sympathize with those who find loop/recur to be less readable. For loops nest well and allow for some pretty intricate control flow with judicious use of continue and break; complex nested for loops like that can be hard to analyze, but the equivalent loop/recur can be even worse. Nevertheless, loop/recur is the way general looping is done in Clojure, so for now, we'll just accept it along with its advantages and disadvantages and move on.

Now it's time to look at the iterate-through-indices version. Again, we begin by converting the Python for loop to a Python while loop.

def removeFirst(itemToRemove, coll):
newCollection = []
index = 0
while (True):
if (index == len(coll)):
# We're done now
return newCollection
else:
item = coll[index]
if (item == itemToRemove):
# Add the rest of the elements all at once and we're done.
newCollection.extend(coll[index+1:])
return newCollection
else:
newCollection.append(item)
index += 1


This time, we see the things that change while looping are newCollection and index. The above code can now be mechanically translated to:

(defn remove-first [item-to-remove coll]
(loop [new-collection [],
index 0]
(if (= index (count coll))
new-collection
(let [item (coll index)]
(if (= item item-to-remove)
; into is like Python's extend, subvec is like Python's slice
(into new-collection (subvec coll (inc index)))
(recur (conj new-collection item) (inc index)))))))


Notice how in Clojure code, we don't need to explicitly call return, return is implied.

In Python we had two basic strategies, iterate by items and iterate by indices, and both had an analog in Clojure. But remember, the reason why we needed two strategies in Python was that there was no way to capture the "add the rest of the items to the collection" concept in the version that iterated through items, so we were forced to choose between iterating through items and use a flag to go into "copy items until reaching the end of the list"-mode, or use indices so we could take a slice.

But Clojure offers us a way to fuse these two strategies together, because its basic iteration mechanism (the linked-list-style view) DOES make it extremely easy to work with the notion of "the rest of the items" without any index manipulation or slicing.

Fusing the two Python strategies in Clojure, we get this:

(defn remove-first [item-to-remove coll]
(loop [coll coll,
new-collection []]
(if (empty? coll)
new-collection
(let [item (first coll)]
(if (= item item-to-remove)
; Extend new-collection with rest of coll and return in one step
(into new-collection (rest coll))
(recur (rest coll) (conj new-collection item)))))))


There's one downside to this implementation, namely, we're paying a performance penalty for gradually building up this vector by extending an immutable object one item at a time, when we don't care about and will never use the intervening steps between the empty vector and the final new collection. It's not a huge penalty, but it's real. Fortunately, Clojure offers a "recipe" to convert such functions. Upon entering the loop, you initialize the thing you're building to a transient (somewhat more mutable) vector. Then you append to it using conj! rather than conj, and finally you convert it back to immutable at the end using persistent!. The result looks like this:

(defn remove-first [item-to-remove coll]
(loop [coll coll,
new-collection (transient [])]
(if (empty? coll)
(persistent! new-collection)
(let [item (first coll)]
(if (= item item-to-remove)
(into (persistent! new-collection) (rest coll))
(recur (rest coll) (conj! new-collection item)))))))


Note that the overall shape of the code has not changed, we've just added a few annotations that improve performance. But remember this step is optional, the previous version is perfectly fine for most purposes.

Although we were mainly trying to mimic the Python code, it's worth noting that Clojure code is highly polymorphic. Whereas the Python code only takes Python lists (and maybe some of the above versions would take strings), this Clojure algorithm will work on any collection (arrays, lists, lazy lists, vectors, sets, maps, strings, etc.) because all have that wonderful property of being viewable as lists. However, the returned collection is specifically a vector, no matter the input, so depending on the context, the polymorphism of the input may have limited utility.

Now it's time to move on to Scheme. We'll look at a standard Scheme implementation of remove-first, and see how to translate that into Clojure. These examples have been tested in the Racket dialect of Scheme.

In Scheme, the most basic, native collection type is the linked list. The three fundamental operations on a Scheme list are empty?, first, and rest (sound familiar?) and you can also non-destructively add an item to the front of the list with (cons item list). Adding to the back of a list is a slow operation, so the strategy of building up a new collection by adding to the back is not a particularly desirable one in Scheme. Instead, the strategy is to use recursion, essentially allowing the call stack to build the sequence of items that need to be added, eventually, to the front of an empty list. It sounds confusing, but once you have your head wrapped around recursion, it all makes perfect sense (and if you don't understand recursion, head directly to htdp.org).

In any case, a typical Scheme implementation looks like this:
(define (remove-first item-to-remove coll)
(if (empty? coll)
empty
(let ([item (first coll)])
(if (equal? item item-to-remove)
(rest coll)
(cons (first coll) (remove-first item-to-remove (rest coll)))))))


Clojure also has a list collection, and the translation is about as straightforward as it could possibly be, requiring only a couple syntactic changes:

(defn remove-first [item-to-remove coll]
(if (empty? coll)
() ; literal name for empty list
(let [item (first coll)]
(if (= item item-to-remove)
(rest coll)
(cons (first coll) (remove-first item-to-remove (rest coll)))))))



But there's a catch. Because Scheme relies on this kind of programming style, Scheme implementations are designed in such a way so that the stack has rather huge limits, basically being limited by your overall memory rather than some specific call-stack memory limitation. In other words, if the resulting list is small enough to fit in your computer's memory, than in all likelihood the call stack necessary to process it with recursion will fit as well. So call stack limitations are mostly a non-issue in Scheme.

However, Clojure is limited to Java stack limits, so this style of writing will definitely place a limit on the size of collection that can be processed by this function. Fortunately, there is a rather simple solution. Clojure offers seamless interoperation between lazy lists and regular lists. Lazy lists solve the stack problem by avoiding the recursive step, returning immediately with a list-like object that can be probed on-demand for first and rest information by the consumer. Further elements will be computed as needed, and will be driven by the consumer's looping process.

This is accomplished by wrapping a call to lazy-seq around some part of the computation. There are at least three reasonable places to place the call to lazy-seq. You can put lazy-seq around the full body of the function.
(defn remove-first [item-to-remove coll]
(lazy-seq (if (empty? coll)
()
(let [item (first coll)]
(if (= item item-to-remove)
(rest coll)
(cons (first coll) (remove-first item-to-remove (rest coll))))))))



You can place it around the cons.
(defn remove-first [item-to-remove coll]
(if (empty? coll)
()
(let [item (first coll)]
(if (= item item-to-remove)
(rest coll)
(lazy-seq (cons (first coll) (remove-first item-to-remove (rest coll))))))))


You can place it around the recursive call to remove-first.
(defn remove-first [item-to-remove coll]
(if (empty? coll)
() ; literal name for empty list
(let [item (first coll)]
(if (= item item-to-remove)
(rest coll)
(cons (first coll) (lazy-seq (remove-first item-to-remove (rest coll))))))))


Each choice results in slightly different laziness behavior, i.e., when various elements are computed, but the overall semantics of the sequence remains the same and stack overflows will be avoided. Placing the lazy-seq around the recursive function call will cause remove-first to compute the first element right away, and delay the rest. Placing the lazy-seq around the full body will prevent any computation until it is asked for by a consumer. Placing the lazy-seq around the cons results in in immediate behavior for the nil and removable-item-at-front-of-list case, and delayed behavior otherwise.

All are acceptable choices, but preferences vary. Probably placing lazy-seq around the full body is the most common style you'll see in Clojure, although I tend to place it where the laziness is actually required (like around the recursive call, or around the cons).

Converting remove-first so that it returns lazy lists definitely generates some additional overhead than a strict list. However, this overhead pays for itself if you ever end up using just part of the list, because no time is spent generating the parts you don't need. There's something very refreshing, freeing, and efficient about writing functions that find the first object with some particular property by using the strategy of taking the first item from the list of ALL objects with that particular property, knowing full well that the complete list will never be generated. Lazy lists can be used as an alternative to many traditional control structures (the above example of taking the first item from a lazy-list of objects satisfying a given description is an elegant substitute for something that would require a for-loop-break iteration in a traditional language). Generally speaking, lazy lists are more useful than strict lists, and for that reason, lazy lists are the norm in Clojure rather than the exception.

Since Clojure offers both vectors (similar to Python's lists) and lists (similar to Scheme's lists), we have seen that it is possible to convert both algorithmic styles into Clojure. With Python, the main thing that needed to be dealt with was adapting the algorithm to build an immutable, rather than a mutable, vector. This was further complicated by the fact that Clojure's general loop/recur looping construct doesn't exactly match up with Python's for loop construct and a conversion process is needed. But the final result captured the spirit of the Python code well. Coming from Scheme was easier, and the only real modification that was necessary was to output a lazy list rather than a strict list. Both versions can take any collection as an input, but one produces vectors and the other produces lazy lists. Both implementations are legitimate choices, depending on the desired usage.