Thursday, January 8, 2009

Laziness in Clojure – Traps, workarounds, and experimental hacks

The power of laziness

Consider the following snippet of Clojure code, which is similar to code found in several Clojure blogs, and code found in Clojure's contributed code base:

(def whole-numbers (iterate inc 0))

This defines whole-numbers to a lazy sequence of all the numbers 0, 1, 2, …. Laziness is a powerful concept, and it allows us to simulate an infinite sequence. As you need more numbers, more are generated.

So for example, let's say you want to find the first whole number that satisfies a given predicate function (let's call it pred). In an imperative language, you might do something like this:


i = 0
while not pred(i):
i += 1
return i


On the other hand, with laziness, you can do something like this:

(first (filter pred whole-numbers))

Gotcha!

But there is a subtle problem with the definition of whole-numbers. Do you see it? I'll give you a hint: (iterate inc 0) does in fact produce the lazy sequence of whole numbers.

The problem is that we gave a name to it. Huh? How could that make a difference? Yes, as surprising as it may seem, (first (filter pred (iterate inc 0))) works just fine, whereas (first (filter pred whole-numbers)) is likely to run slowly, and may even crash your program at runtime for certain inputs.

How is this possible? How could giving something a name break your code? That's just crazy, right?

Well, this pitfall stems from a design decision in Clojure that all lazy sequences are cached. In other words, as Clojure expands the sequence, it automatically caches the values in a linked list, so that they never have to be computed again. The next time you traverse the sequence, you are just seeing the values that were previously cached.

Caching has a number of benefits. If your sequence is very computation-intensive, and you plan to traverse it multiple times, then caching can give a huge performance boost. If your sequence represents a traversal of some sort of non-persistent data structure, caching is essential to ensure that repeat calls to first and rest always yield the same result.

But Clojure caches all lazy sequences, which creates a number of traps for the unwary. The “unnamed” version of the code works because the garbage collector collects the cached cells as it goes. (Even though it is somewhat wasteful to cache all these cells and immediately throw them away, Java's collection of short-lived garbage is very fast, and it's not as much of a performance hit as you might expect). However, when you give a name to the whole-numbers, the garbage collector can't do any collection. As you traverse the whole-numbers sequence, you'll have a huge performance hit as massive gobs of memory are allocated. And eventually, you'll go too far, run out of memory, and your program crashes.

To see this in action, go compare the following on your Clojure setup:


(nth (iterate inc 0) 20000000)
(nth whole-numbers 20000000)
;uses the above def of whole-numbers


On my machine, the first example completes in just a few seconds. The other crashes.

Sometimes cached lazy sequences are very useful, but in this case, caching just gets in the way. No one would ever want to cache the whole-numbers sequence. It's significantly faster to generate the sequence via incrementing every time you need to traverse the sequence, than it is to cache and hold the values. Furthermore, because the sequence represents an infinite sequence, there's no upper limit to the memory consumption. If you use a named version of the whole numbers in your program, there's a good chance that eventually, your program will break with certain large enough inputs.

Workarounds

Programming is all about abstracting out commonalities in code, so it's rather frustrating that you can't give a name to the whole-numbers sequence and need to type (iterate inc 0) everywhere. This is a fairly short definition, but you can imagine how with a more complex sequence, giving a name might really be essential.

Fortunately, there is a workaround. Instead of giving the whole-numbers sequence a name, you give a name to a function that knows how to produce the sequence. In other words,

(defn whole-numbers [] (iterate inc 0))

Now, you have to change all your uses of whole-numbers to a function call as well:

(nth (whole-numbers) 20000000) ;This works!

The reason this works is that every time you call this function, it produces a brand new sequence of whole-numbers, which is unnamed, so the garbage collector can collect as it goes.

So when you're writing code in Clojure, every time you create a lazy sequence, you need to ask yourself two questions:

1. What will happen to my code if the entire sequence becomes realized in memory? Is the sequence too big to be accommodated?
2. Is it cheaper to generate the items in the sequence from scratch each time, than to allocate the memory necessary to cache the items?

If the answer to either of these questions is yes, then you should avoid naming your sequence, or wrap it in a function.

If you plan to program in Clojure, you need to be aware of this pitfall, and you must know the workaround. From what I can tell from the various blogs and posts on Clojure's google group, many new users are falling into this trap and naming potentially large sequences, creating fragile code with a very real danger of failure.

Not Satisfied

There are several reasons why I find this workaround unsatisfying.

1. It imposes a signifcant cognitive burden because it not only affects the way you name a sequence (by wrapping it in a function), but it also affects every place you use the sequence. In fact, once you have a library filled with a combination of regular lazy sequences, and some of these function-wrapped lazy sequences, then every time you use one of your sequences, you have to remember which kind it is in order to call it correctly.
2. It's not perfect. Although wrapping a sequence in a function prevents any global var from pointing at the sequence, it is still possible for some function that manipulates the sequence to accidentally hold onto a reference to some portion of the sequence, causing a memory crash. For example, it was recently pointed out on the google group that when filtering a large sequence (even if the sequence is unnamed), the filter function, as it is filtering, holds onto a portion of the sequence as it scans ahead to find the next element for the filtered sequence. If the elements that pass the filtering test are spread out too far, the program crashes. Several people looked closely at the code, and couldn't figure out why it was crashing. Eventually, Rich Hickey, the designer of the language pointed out what was going on. He plans to write a more complicated version of filter in a future version of Clojure that will avoid this particular problem, but that's not really the point. The concern here is that even knowing the function-wrapping trick, cached lazy sequences represent a certain kind of danger that is difficult to isolate and understand. When several Clojure programmers have trouble finding the source of a memory crash in a one-line piece of code, you can imagine how difficult it will be on a large body of code.
3. If you know for sure in advance that you'd rather not have the sequence be cached, there is currently no way to express that in Clojure. This workaround doesn't actually suppress caching, it just makes it so the garbage collector can throw away the cached values right away. You're still incurring a (small, but measurable) performance penalty from the unnecessary caching.
4. Even if the workaround worked reliably, this is a pretty big “gotcha” for newcomers to the language.

Is there another way?

A few weeks ago, I posted about this topic on the Clojure google group, questioning the wisdom of this design decision to have all lazy sequences cache their values. I argued that it would be a cleaner design if lazy sequences did not cache their values, unless the programmer explicitly asked for the sequences to be cached. Basically, my argument revolved around two main points. First, it's easier to convert an uncached sequence to a cached sequence than vice versa. Second, if you forget to cache something that should be cached, it's merely a performance optimization problem, but if you forget to uncache something that needs to be cached, your program can crash. So, it's safer if the language defaults to uncached.

Rich Hickey, the designer of Clojure, responded by saying that he had already experimented with uncached lazy sequences, and that such a choice causes a different set of problems with performance problems and code breaking – problems which he found to be even more common than the ones I've raised here. He encouraged me to try my own experiments, and report back. The rest of this blog post goes into more details about the nature of my experiments since that thread.

Four categories of sequences

Before taking the plunge and doing some experimenting, my first step was to analyze the various use cases of lazy sequences that need to be dealt with. I found that lazy sequences fall into one of four categories:
1. A sequence where the function to generate the sequence is very fast to compute, and the function is a “pure” function in the sense that it always generates the same output. For this use case, you never want to cache, and it becomes especially important if the sequence is large.
2. A sequence that will only be traversed once. If you really know for sure in advance that you will only be traversing it once, then you're better off not caching. Again, this is especially important if the sequence is large.
3. A sequence where it is slow to compute successive elements, and you'll possibly need to do this more than once. Caching in this case is important for performance.
4. A sequence where the function to generate successive elements is not guaranteed to return the same values. This can come up with Java interop, with I/O, and other sequence views of a non-persistent data structure. Caching is essential here to impose a sane, consistent view on the data.

RH seemed especially concerned about how category 4 sequences break without caching. Now I'll be the first to admit that I haven't written a huge body of work in Clojure, but looking through my code from the past few weeks, I discovered that I didn't have any category 4 sequences in my code. First, I don't tend to deal with Java interop; I write entirely using the core Clojure data structures, which are all immutable. So any lazy sequence I write is guaranteed to be consistent, even without caching. I also don't do much with I/O. I tend to just write functions that I can use interactively from the REPL.

Now if I were going to create a lazy sequence from an ephemeral source, I would almost certainly be using one of Clojure's built-in functions that do this conversion for you, such as line-seq, resultset-seq, file-seq, iterator-seq, enumeration-seq, etc., rather than using lazy-cons directly. So as long as those functions return a cached sequence, I pretty much don't have to worry about category 4. Furthermore, RH has said that he is working on a completely new abstraction (tentatively called streams), that (as I understand it) is a better fit for I/O and other ephemeral sources than the sequence abstraction. I speculate that once he has developed this new abstraction, the concerns about category 4 sequences will largely go away. People will generally write “streams” over ephemeral sources, and then convert them to cached lazy sequences with a one-time call to stream-seq. So as long as stream-seq builds a lazy sequence, category 4 is well supported, and we can analyze the relative merits of cached vs. uncached sequences for categories 1-3 separately.

Since category 4 sequences aren't really present in my own code base, my experiments mainly revolve around trying to discover what feels natural for categories 1-3.

Experiment #1 – Totally Uncached

Since RH experimented with uncached laziness in a previous version of Clojure, the code for an uncached lazy sequence builder is right there in the existing Clojure codebase, but the corresponding macro (lazy-seq as opposed to lazy-cons), has been commented out. So for my first experiment, I wanted to look at what it would feel like to code in an environment where everything in Clojure built from lazy-cons (except the ones that represent ephemeral “streams”) is uncached.

To do this, I created a namespace called uncached, in which I copied over most of Clojure's core constructs that use lazy-cons (but not file-seq, enumerator-seq, etc.). Within this namespace, I modified lazy-cons to create an uncached LazySeq rather than LazyCons. In other words,


(defmacro lazy-cons [first-expr rest-expr]
(list 'new 'clojure.lang.LazySeq
(list `fn (list [] first-expr) (list [(gensym)] rest-expr))))


So anything I import from this namespace will build an uncached sequence. To make the experiment as extreme as possible, I then went to every bit of Clojure code I've written (which I admit isn't that much, but hey, experiments have to start somewhere), and excluded all these functions from the core, using my uncached imported functions instead.

I found that by using uncached sequences, my code felt a little zippier from a performance standpoint. I found that the vast majority of the sequences that I construct are only used for one pass. Furthermore, most of my sequences are very long (possibly infinite), and can be generated quickly, so the uncached behavior was a great default fit for me.

One interesting example from my code is a function I wrote that produces a sequence of all the permutations of a given collection. Now generating the next permutation in the sequence is not exactly a trivial operation. So caching does in fact speed things up if you're going to traverse the permutation sequence multiple times. However, what I discovered is that there's such a huge time hit from allocating the memory for caching purposes the first time through, that you'd have to traverse the permutation sequence at least 20 times to begin to make up for the time lost from caching. Even so, caching becomes completely impractical once you hit permutations of 10+ items, so I've concluded that a permutations sequence should just be uncached.

Now at one point, I applied a filter to my permutation sequence, to extract permutations that had a certain property. This filtered sequence is something that did in fact make sense to cache, provided I intended to use it more than once Fortunately, the Clojure api already includes a function called cache-seq which does exactly that. I found it very easy to get the caching behavior I wanted for this specific case – at the point where I defined the filtered sequence, I wrapped it in a call to cache-seq. Alternatively, I could have called vec on the sequence to explicitly realize the sequence.

(def fs (cache-seq (filter pred (permutations (range 10)))))

So, at least in my own code, the default of not caching sequences worked rather well. There was one instance where I needed to cache the sequence, and it was easy to accomplish that. But again, I need to admit that I've only written a small amount of Clojure code (probably no more than 2kloc). So I can't claim this proves anything. I'm providing my simple uncached library so that others can also try this very interesting experiment.

Experiment #2 – Take your pick

If we assume that defaulting to uncached isn't right for everyone, there's still the open question of what it would be like to program in a version of Clojure that offers a choice between cached and uncached versions of lazy-cons and its core sequence functions. To explore this option, I made use of the same “uncached” library, but rather than excluding the core functions and overriding them with the uncached versions, I just “require”d my uncached library so that both versions of functions were available to me. So I could call lazy-cons or uncached/lazy-cons, map or uncached/map, for or uncached/for.

One really nice aspect of Clojure's design is the way that anything that adheres to the sequence interface works just fine with all the sequence functions. So the amazing thing is that whether you choose to build a cached or uncached sequence, it makes not one bit of difference to the consumers of your sequence. So once you make your decision as to whether a sequence should be cached or uncached, you can basically just forget about it and everything works seamlessly as you pass that sequence around.

Despite that, at first it felt like a burden to have to constantly think about whether I needed a cached or uncached version of a sequence. But then again, I had already been doing similar analysis to avoid getting burned by a memory crash from caching, so really it wasn't much different than before. The main difference was that now I could really specify that I wanted something uncached, rather than using the function-wrapping workaround. Consuming the two types of sequences was now equally easy, and I got a slight performance boost as well.

I also noticed something rather interesting in the patterns of when I tended to call cached vs. uncached versions of the core functions.

For some of the functions, I was always calling the uncached versions, namely cycle, repeat, replicate, interleave, interpose, take, take-while, butlast, concat, and lazy-cat. And as I think about it further, I honestly can't think of any time you'd want a cached version of these functions. Remember that if your underlying sequences that you are operating on are cached, these functions will be equally persistent, so it's really a question of how time-consuming their operations are, and these have very little overhead. For this reason, I believe that, even if Clojure makes no other changes in its approach to laziness, it would be a simple, non-breaking, but significant improvement to change the above core functions to internally use lazy-seq as opposed to lazy-cons.

On the other hand, I found that distinct, filter, remove, and drop-while were the most likely to need to be cached.

If everything cleanly fell into the category of either definitely needing to be cached, or definitely needing to not be cached, things would be simple. Alas, that is not the case. For things like map, for, drop, and take-nth, it all totally depends on how complex the functions are (or how big the n is).

So for those functions, it is very useful to be able to choose cached or uncached. But this begs the question of what will happen when other programmers start creating sequence-producing functions. In some cases they'll be able to make an executive decision in advance as to whether the resulting sequence is cached or uncached. But what about the cases where the consumer will need to be able to make a choice. Do we expect the programmer to provide both a cached and an uncached version?

Contrast this with experiment #1, in which lazy-cons always produces uncached sequences. With such behavior, the programmer of a new sequence-producing function just uses (uncached) lazy-cons – the consumer knows it will be uncached, and can easily turn it into cached at point of naming, if necessary.

Summarizing Experiment #2, I'll say that I really liked having added control, and the ability to select cached or uncached sequences, but I just can't imagine how people will easily write libraries that provide both options.

Experiment #3 – Intelligent auto-selection

Since some of the sequences should clearly be cached, and some clearly not cached, it would be ideal if the borderline cases could be chosen intelligently by the language in ordrer to completely remove the cognitive burden of constantly having to choose. At first, I thought maybe a scheme would work in which the cached/uncached behavior of the lazy-cons depends on the nature of the thing you're consing onto. But this isn't really useful. The desired cached/uncached behavior depends more on the complexity of the delayed function. After some experimentation, I feel that it is not possible to automate the decision. So this experiment was definitely a failure.

Experiment #4 – Uncaching a cached sequence

The major problem with Experiment #2 is that it forces library writers to supply two flavors of their sequence generating functions, which is impractical. So I tried to get really clever. For this experiment, I went back to the standard lazy-cons behavior, i.e., caching by default. But then, I tried to write a macro that would suppress caching for any sequence built with lazy-cons. I did this by setting up a global *cached* var that has a root binding of true. Lazy-cons does whatever behavior the *cached* var is set to. A special uncached macro binds the var to false. Like this:


(def *cached* true)

(defmacro lazy-cons
[first-expr rest-expr]
(list 'if '*cached*
(list 'new 'clojure.lang.LazyCons
(list `fn (list [] first-expr)
(list [(gensym)] rest-expr)))
(list 'new 'clojure.lang.LazySeq
(list `fn (list [] (list 'binding ['*cached* 'false]
first-expr))
(list [(gensym)] (list 'binding ['*cached* 'false]
rest-expr))))))

(defmacro uncached [& rst]
`(binding [*cached* false]
~@rst))


This basically works, in the sense that you can say something like (uncached (map * (iterate inc 0) (iterate inc 1)))) and the uncached macro affects all the calls to lazy-cons within map and iterate, so you've forced this thing to be uncached “all the way down”. But the way my macro works, uncached sequences become extremely slow. Because bindings aren't captured by the closures, the instruction to rebind *cached* has to be threaded through the delayed closures. This noticeably hinders the performance of uncached sequences. If you flipped it around and made uncached the default, then cached sequences would suffer the performance hit. Is there a better way to write this macro? If not, I must decree this experiment to be a failure.

Conclusions

I find Clojure's current cache-by-default-with-no-option-for-uncached laziness to be unsatisfying. I genuinely hope there is a better solution, and I want to help find it. Clearly sequences generated from ephemeral stream-like entitities must always be cached. But dealing with the other types of sequences, I find that my experiments with uncached-by-default-with-option-for-cached laziness turned out to be quite pleasant. This may very well be a function of my own programming niche, so I've provided a simple uncached library so others can try to replicate this experiment with their own code. If more people report success with uncached-by-default, maybe a stronger case can be made for change.

My other experiments were less successful, although I learned quite a bit from trying them, which is why I reported on those experiments as well. Most importantly, I gained a deeper understanding of what types of functions tend to produce sequences that need to be cached and which ones tend to produce sequences that should be uncached. This suggests that, at a minimum, some of the core library functions would benefit from being changed to produce uncached sequences.

Perhaps someone else will see a way to turn one of these approaches into something workable, or provide an entirely new solution.