Tuesday, April 21, 2009

ADTs in Clojure

Today on the Clojure mailing list, a user asked:
"Is the concept of Abstract Data Types useful in Clojure?
If yes, how would you implement one?"

This is an important question because coding to interfaces and abstractions, rather than concrete data, is a key aspect to writing scalable programs.

My favorite treatment of ADTs is in Concepts, Techniques, and Models of Programming by Peter Van Roy and Seif Haridi. This book shows all the variations of ADTs: stateful and declarative (aka functional), bundled and unbundled, open and secure. I'm going to walk through a few of these variations in Clojure.

Let's consider a classic stack ADT. Since Clojure primarily emphasizes functional programming, I'm going to focus on functional implementations of this ADT, i.e., versions in which pushing and popping a stack are non-destructive and actually return a new stack.

Open unbundled



Now the whole point of an ADT is to separate the implementation from the interface, but for starters, let's implement a stack ADT using an underlying implementation of a list. In an "unbundled" implementation, the stack data is completely separate from the functions that operate on it. It is "open" because we make no attempt to hide the implementation details of the stack. We are simply going to trust the programmer to use our interface and ignore those details.


(defn stack-new [] nil)
(defn stack-push [s e] (cons e s))
(defn stack-top [s] (first s))
(defn stack-pop [s] (rest s))
(defn stack-empty? [s] (empty? s))



This open, unbundled implementation is probably the simplest way to code an ADT in Clojure. Because it is the simplest, this is probably the most common kind of coding you'll see in Clojure, and thus many would argue it is the most idiomatic. But just because it's the easiest to code doesn't mean it's the best.

The most obvious problem with an open implementation is that it is a leaky abstraction. In other words, client code can easily see that this stack is implemented as a list. And it would be all too easy to forget that this implementation is supposed to be hidden, and call a list function on our stack (e.g., count).

I have seen several comments on the clojure mailing list suggesting that it is inevitable that any ADT written in Clojure will be inherently leaky. But that is not the case...

Secure unbundled



The idea is to wrap the data structure in something that effectively locks it with a unique key, so it can only be unwrapped with the correct key. The only functions that know the key are the ones that serve as the interface for the ADT, so nothing else can inspect or tamper with the contents of the data structure. For optimum security you'd need to use cryptographic techniques, but for illustration purposes, I've simply used gensym to generate a unique key.

(let [security-key (gensym),
wrap (fn [x]
(fn [k]
(if (= security-key k) x (throw (new Exception))))),
unwrap (fn [x] (x security-key))]

(defn stack-new [] (wrap nil))
(defn stack-push [s e] (wrap (cons e (unwrap s))))
(defn stack-top [s] (first (unwrap s)))
(defn stack-pop [s] (if (stack-empty? s) nil
(wrap (rest (unwrap s)))))
(defn stack-empty? [s] (empty? (unwrap s))))


Notice how wrap and unwrap are local to these functions. If you play around with these stack functions, you'll see that the stacks you generate appear to the outside world as functions, and you need to pass it the right key to get at the innards. In this case, a gensym is relatively easy to forge, but you should get the idea. The innards are very well protected and there is essentially no way to manipulate the stack other than through the approved interface functions.

One problem with unbundled ADTs is that they lack polymorphism. If you have multiple stack implementations, each one will require its own set of interface functions with different names (or the same names in different namespaces). If you want to write a function that operates over any stack implementation, it will first require you to pass in some sort of map of the interface functions to use. Although this technique is relatively common in functional languages (especially in the ML family), it's a fairly clunky way to achieve polymorphism.

Secure bundled



The next logical step is to bundle the data along with the functions that know how to operate on it. This looks something like this:


(let [stack-object (fn stack-object [s]
(let [push (fn [e] (stack-object (cons e s))),
top (fn [] (first s)),
pop (fn [] (if (seq s) (stack-object (rest s)) nil)),
empty? (fn [] (empty? s))]
{:push push, :top top, :pop pop, :empty? empty?}))]
(defn stack-new [] (stack-object nil)))


In this implementation, a stack is an associative map of four interface functions. The actual data of the stack is hidden by lexical scoping so that only the interface functions can see it.

The big problem here is that the syntax for manipulating these bundled stacks is fairly unpleasant. For example,

(def stack1 (stack-new))
(def stack2 ((stack1 :push) 2))
(def stack2-top ((stack2 :top)))


You can improve the readability by providing interface functions that look like the unbundled version:


(defn stack-push [s e] ((s :push) e))
(defn stack-top [s] ((s :top)))
(defn stack-pop [s] ((s :pop)))
(defn stack-empty? [s] ((s :empty?)))


But there's a big difference between this and the unbundled version. Mainly, we get polymorphism. If you have two different concrete implementations of the ADT, your client code doesn't care. When you call stack-push, for example, the function looks up the correct push function for this given implementation in the bundle, i.e., (s :push) and calls that to do the pushing.

Secure bundled, another way



You may have noticed that the secure bundled approach is essentially the way that OO languages provide ADTs. Since Clojure interoperates with Java, it stands to reason that you should be able to use Java to implement your ADTs. Yes, you can.

The first step is to define your interface. You can either do this directly in Java, or use Clojure's ability to generate Java interfaces with something like this:

(gen-interface
:name adt.IStack
:methods [[push [Object] adt.IStack]
[top [] Object]
[pop [] adt.IStack]
[empty? [] Boolean]])


This code would need to be compiled with Clojure's compile function, and that requires a bit of tricky setting up of classpaths and namespaces, but it's doable. Then, the bundled ADT looks very similar to above:

(let [stack-object (fn stack-object [s]
(proxy [adt.IStack] []
(push [e] (stack-object (cons e s)))
(top [] (first s))
(pop [] (if (seq s) (rest s) nil))
(empty? [] (empty? s))))]
(defn stack-new [] (stack-object nil)))


You could also make it even more Java-like by using gen-class. The dot syntax is a bit cleaner than the map-based syntax of the previous version, for example:

(. s top)
(. s push 2)

And as before, you can clean it up even more by creating unbundled interface functions that actually call the bundled versions behind the scenes.

This is how most of the Clojure core interfaces and data structures are implemented, so in some sense, one could argue that this is the most idiomatic approach of all. However, I think many Clojurians would prefer to get away from the Java approach if there is a better way.

And bundled versions are not without their problems. The CTM book gives a great example of a collection ADT and the challenge of writing a union function on the two collections. The problem with the bundled version is that bundles must dispatch on one input, so the union function, when written from the perspective of the first collection, doesn't have access to the private parts of the second collection. A version of the ADT that can see the innards of both inputs might be considerably more efficient.

Secure unbundled, revisited



If the primary limitation of bundled ADT implementations is their single-dispatch nature, perhaps there is a way to go back to the secure unbundled version, but leverage Clojure's multimethods to gain a more sophisticated kind of polymorphism.

In this final, most sophisticated variant, I'm going to go ahead and show two concrete implementations, the list-based one we've been working with as well as a vector-based implementation, so you can see how the two implementations coexist side by side.

First, we define the interface as multimethods that dispatch on the type of stack. Notice how we don't include the constructor stack-new as part of the polymorphic interface. We'll need a separate constructor for each concrete implmentation.

(defmulti stack-push (fn [s e] (type s)))
(defmulti stack-top type)
(defmulti stack-pop type)
(defmulti stack-empty? type)


Since we're dispatching on type, we can't quite use the same wrapping representation as before (because functions can't have metadata). This time, the wrapped representation of the stack will be a map with one field (:wrapped-stack) and metadata with the appropriate type.


(let [security-key (gensym),
wrap (fn [x]
(with-meta
{:wrapped-stack
(fn [k]
(if (= security-key k) x (throw (new Exception))))}
{:type ::list-stack})),
unwrap (fn [x] ((x :wrapped-stack) security-key))]

(defn list-stack-new [] (wrap nil))
(defmethod stack-push ::list-stack [s e] (wrap (cons e (unwrap s))))
(defmethod stack-top ::list-stack [s] (first (unwrap s)))
(defmethod stack-pop ::list-stack [s] (if (stack-empty? s) nil
(wrap (rest (unwrap s)))))
(defmethod stack-empty? ::list-stack [s] (empty? (unwrap s))))


The vector-based version is almost the same:


(let [security-key (gensym),
wrap (fn [x]
(with-meta
{:wrapped-stack
(fn [k]
(if (= security-key k) x (throw (new Exception))))}
{:type ::vector-stack})),
unwrap (fn [x] ((x :wrapped-stack) security-key))]

(defn vector-stack-new [] (wrap []))
(defmethod stack-push ::vector-stack [s e] (wrap (conj (unwrap s) e)))
(defmethod stack-top ::vector-stack [s] (peek (unwrap s)))
(defmethod stack-pop ::vector-stack [s] (if (stack-empty? s) nil
(wrap (pop (unwrap s)))))
(defmethod stack-empty? ::vector-stack [s] (empty? (unwrap s))))


In fact, they are so similar, it seems clear you could write a macro to abstract out some of the commonalities.

In my mind, this is definitely the most interesting implementation of ADTs. By using multimethods, there is the potential to implement ADTs that would be rather difficult to implement efficiently in other languages. Unfortunately, it is also quite clear that this version is considerably more work to write than the naive, open unbundled version we started out with.

I would very much like to see additional syntactic support for making secure, unbundled ADTs easier to write. Something that can simplify or eliminate the need for explicit wrapping and unwrapping would be essential, and of course, a better system for generating security keys than gensym. It's not clear to me whether this support could be provided entirely by a library, or whether additional constructs in the core would be needed.

What about equality?


Another big open question in my mind is, "What happens when you try to add equality?" When you use Clojure's open structures to store your data (like maps), you get equality and hash codes for free. But how hard would it be to add equality and hash code functionality to these secure implementations of ADTs? Is it easier with some of these implementation styles than others? I haven't played around with this aspect of Clojure enough to give an answer yet. I'd love to hear comments from those who have.