Tuesday, April 21, 2009

ADTs in Clojure

Today on the Clojure mailing list, a user asked:
"Is the concept of Abstract Data Types useful in Clojure?
If yes, how would you implement one?"

This is an important question because coding to interfaces and abstractions, rather than concrete data, is a key aspect to writing scalable programs.

My favorite treatment of ADTs is in Concepts, Techniques, and Models of Programming by Peter Van Roy and Seif Haridi. This book shows all the variations of ADTs: stateful and declarative (aka functional), bundled and unbundled, open and secure. I'm going to walk through a few of these variations in Clojure.

Let's consider a classic stack ADT. Since Clojure primarily emphasizes functional programming, I'm going to focus on functional implementations of this ADT, i.e., versions in which pushing and popping a stack are non-destructive and actually return a new stack.

Open unbundled



Now the whole point of an ADT is to separate the implementation from the interface, but for starters, let's implement a stack ADT using an underlying implementation of a list. In an "unbundled" implementation, the stack data is completely separate from the functions that operate on it. It is "open" because we make no attempt to hide the implementation details of the stack. We are simply going to trust the programmer to use our interface and ignore those details.


(defn stack-new [] nil)
(defn stack-push [s e] (cons e s))
(defn stack-top [s] (first s))
(defn stack-pop [s] (rest s))
(defn stack-empty? [s] (empty? s))



This open, unbundled implementation is probably the simplest way to code an ADT in Clojure. Because it is the simplest, this is probably the most common kind of coding you'll see in Clojure, and thus many would argue it is the most idiomatic. But just because it's the easiest to code doesn't mean it's the best.

The most obvious problem with an open implementation is that it is a leaky abstraction. In other words, client code can easily see that this stack is implemented as a list. And it would be all too easy to forget that this implementation is supposed to be hidden, and call a list function on our stack (e.g., count).

I have seen several comments on the clojure mailing list suggesting that it is inevitable that any ADT written in Clojure will be inherently leaky. But that is not the case...

Secure unbundled



The idea is to wrap the data structure in something that effectively locks it with a unique key, so it can only be unwrapped with the correct key. The only functions that know the key are the ones that serve as the interface for the ADT, so nothing else can inspect or tamper with the contents of the data structure. For optimum security you'd need to use cryptographic techniques, but for illustration purposes, I've simply used gensym to generate a unique key.

(let [security-key (gensym),
wrap (fn [x]
(fn [k]
(if (= security-key k) x (throw (new Exception))))),
unwrap (fn [x] (x security-key))]

(defn stack-new [] (wrap nil))
(defn stack-push [s e] (wrap (cons e (unwrap s))))
(defn stack-top [s] (first (unwrap s)))
(defn stack-pop [s] (if (stack-empty? s) nil
(wrap (rest (unwrap s)))))
(defn stack-empty? [s] (empty? (unwrap s))))


Notice how wrap and unwrap are local to these functions. If you play around with these stack functions, you'll see that the stacks you generate appear to the outside world as functions, and you need to pass it the right key to get at the innards. In this case, a gensym is relatively easy to forge, but you should get the idea. The innards are very well protected and there is essentially no way to manipulate the stack other than through the approved interface functions.

One problem with unbundled ADTs is that they lack polymorphism. If you have multiple stack implementations, each one will require its own set of interface functions with different names (or the same names in different namespaces). If you want to write a function that operates over any stack implementation, it will first require you to pass in some sort of map of the interface functions to use. Although this technique is relatively common in functional languages (especially in the ML family), it's a fairly clunky way to achieve polymorphism.

Secure bundled



The next logical step is to bundle the data along with the functions that know how to operate on it. This looks something like this:


(let [stack-object (fn stack-object [s]
(let [push (fn [e] (stack-object (cons e s))),
top (fn [] (first s)),
pop (fn [] (if (seq s) (stack-object (rest s)) nil)),
empty? (fn [] (empty? s))]
{:push push, :top top, :pop pop, :empty? empty?}))]
(defn stack-new [] (stack-object nil)))


In this implementation, a stack is an associative map of four interface functions. The actual data of the stack is hidden by lexical scoping so that only the interface functions can see it.

The big problem here is that the syntax for manipulating these bundled stacks is fairly unpleasant. For example,

(def stack1 (stack-new))
(def stack2 ((stack1 :push) 2))
(def stack2-top ((stack2 :top)))


You can improve the readability by providing interface functions that look like the unbundled version:


(defn stack-push [s e] ((s :push) e))
(defn stack-top [s] ((s :top)))
(defn stack-pop [s] ((s :pop)))
(defn stack-empty? [s] ((s :empty?)))


But there's a big difference between this and the unbundled version. Mainly, we get polymorphism. If you have two different concrete implementations of the ADT, your client code doesn't care. When you call stack-push, for example, the function looks up the correct push function for this given implementation in the bundle, i.e., (s :push) and calls that to do the pushing.

Secure bundled, another way



You may have noticed that the secure bundled approach is essentially the way that OO languages provide ADTs. Since Clojure interoperates with Java, it stands to reason that you should be able to use Java to implement your ADTs. Yes, you can.

The first step is to define your interface. You can either do this directly in Java, or use Clojure's ability to generate Java interfaces with something like this:

(gen-interface
:name adt.IStack
:methods [[push [Object] adt.IStack]
[top [] Object]
[pop [] adt.IStack]
[empty? [] Boolean]])


This code would need to be compiled with Clojure's compile function, and that requires a bit of tricky setting up of classpaths and namespaces, but it's doable. Then, the bundled ADT looks very similar to above:

(let [stack-object (fn stack-object [s]
(proxy [adt.IStack] []
(push [e] (stack-object (cons e s)))
(top [] (first s))
(pop [] (if (seq s) (rest s) nil))
(empty? [] (empty? s))))]
(defn stack-new [] (stack-object nil)))


You could also make it even more Java-like by using gen-class. The dot syntax is a bit cleaner than the map-based syntax of the previous version, for example:

(. s top)
(. s push 2)

And as before, you can clean it up even more by creating unbundled interface functions that actually call the bundled versions behind the scenes.

This is how most of the Clojure core interfaces and data structures are implemented, so in some sense, one could argue that this is the most idiomatic approach of all. However, I think many Clojurians would prefer to get away from the Java approach if there is a better way.

And bundled versions are not without their problems. The CTM book gives a great example of a collection ADT and the challenge of writing a union function on the two collections. The problem with the bundled version is that bundles must dispatch on one input, so the union function, when written from the perspective of the first collection, doesn't have access to the private parts of the second collection. A version of the ADT that can see the innards of both inputs might be considerably more efficient.

Secure unbundled, revisited



If the primary limitation of bundled ADT implementations is their single-dispatch nature, perhaps there is a way to go back to the secure unbundled version, but leverage Clojure's multimethods to gain a more sophisticated kind of polymorphism.

In this final, most sophisticated variant, I'm going to go ahead and show two concrete implementations, the list-based one we've been working with as well as a vector-based implementation, so you can see how the two implementations coexist side by side.

First, we define the interface as multimethods that dispatch on the type of stack. Notice how we don't include the constructor stack-new as part of the polymorphic interface. We'll need a separate constructor for each concrete implmentation.

(defmulti stack-push (fn [s e] (type s)))
(defmulti stack-top type)
(defmulti stack-pop type)
(defmulti stack-empty? type)


Since we're dispatching on type, we can't quite use the same wrapping representation as before (because functions can't have metadata). This time, the wrapped representation of the stack will be a map with one field (:wrapped-stack) and metadata with the appropriate type.


(let [security-key (gensym),
wrap (fn [x]
(with-meta
{:wrapped-stack
(fn [k]
(if (= security-key k) x (throw (new Exception))))}
{:type ::list-stack})),
unwrap (fn [x] ((x :wrapped-stack) security-key))]

(defn list-stack-new [] (wrap nil))
(defmethod stack-push ::list-stack [s e] (wrap (cons e (unwrap s))))
(defmethod stack-top ::list-stack [s] (first (unwrap s)))
(defmethod stack-pop ::list-stack [s] (if (stack-empty? s) nil
(wrap (rest (unwrap s)))))
(defmethod stack-empty? ::list-stack [s] (empty? (unwrap s))))


The vector-based version is almost the same:


(let [security-key (gensym),
wrap (fn [x]
(with-meta
{:wrapped-stack
(fn [k]
(if (= security-key k) x (throw (new Exception))))}
{:type ::vector-stack})),
unwrap (fn [x] ((x :wrapped-stack) security-key))]

(defn vector-stack-new [] (wrap []))
(defmethod stack-push ::vector-stack [s e] (wrap (conj (unwrap s) e)))
(defmethod stack-top ::vector-stack [s] (peek (unwrap s)))
(defmethod stack-pop ::vector-stack [s] (if (stack-empty? s) nil
(wrap (pop (unwrap s)))))
(defmethod stack-empty? ::vector-stack [s] (empty? (unwrap s))))


In fact, they are so similar, it seems clear you could write a macro to abstract out some of the commonalities.

In my mind, this is definitely the most interesting implementation of ADTs. By using multimethods, there is the potential to implement ADTs that would be rather difficult to implement efficiently in other languages. Unfortunately, it is also quite clear that this version is considerably more work to write than the naive, open unbundled version we started out with.

I would very much like to see additional syntactic support for making secure, unbundled ADTs easier to write. Something that can simplify or eliminate the need for explicit wrapping and unwrapping would be essential, and of course, a better system for generating security keys than gensym. It's not clear to me whether this support could be provided entirely by a library, or whether additional constructs in the core would be needed.

What about equality?


Another big open question in my mind is, "What happens when you try to add equality?" When you use Clojure's open structures to store your data (like maps), you get equality and hash codes for free. But how hard would it be to add equality and hash code functionality to these secure implementations of ADTs? Is it easier with some of these implementation styles than others? I haven't played around with this aspect of Clojure enough to give an answer yet. I'd love to hear comments from those who have.

6 comments:

  1. Thank you for the insightful link to the CTM book. I didn't know its existence, and already bought it on amazon!

    At the end of the blog, you say "what about equality" ?
    I'll also add "what about persistence", since the addition of closures to your data does not allow for an easy way to marshall/unmarshall your data anymore (it would be great if an anonymous function could marshall/unmarshalled based on the macro-expanded code that compiled itself :-).

    ReplyDelete
  2. Nice post, the idea of using a closure with an access key is definitely interesting, though in practice I'd expect to see Clojure/Lisp programmers adopt the open rather than the secure approach. One reason is performance, the other is equality testing, as you note.

    I noticed that the implementation I would choose for stack is absent from your list: I'd use an open data structure (list or vector), but with a multimethod interface for polymorphism. In fact, the multimethod interface could well be shared between an open and a secure implementation.

    Of course your list is not complete, there are many more solutions. Perhaps an interesting one is a semi-secure stack, in which the basic structure (say a list) is wrapped up in a map under a key that is generated "at random" (gensym, for example). In this case, the structure would be open and visible to anyone (with the side benefit of simple equality testing), but it would require some trickery to circumvent the access functions, to the point that one could safely assume that it wouldn't happen by accident. For additional protection, add more "random" keys with nonsense values to the map.

    Back to the question of equality. My very first attempt to define algebraic data structures used closures. I abandoned it for several reasons, one of them being the lack of equality testing. But one approach I had considered at the time could be of use in situations like the one you describe: define a class (using gen-class) that has one variable holding a data item and that implements the comparison interface by delegation to this variable. Secure-yet-comparable data types could then be implemented by subclassing and providing accessors to the data item.

    ReplyDelete
  3. Here's a concrete implementation of the semi-secure stack that I proposed. It implements the multimethod interface given in the post:

    (let [security-key (keyword (str (gensym)))]

    (deftype ::my-stack my-stack
    (fn [l] {security-key l})
    (fn [s] (security-key s)))

    (defn my-stack-new [] (my-stack '()))

    (defmethod stack-push ::my-stack [s e] (my-stack (cons e (security-key s))))
    (defmethod stack-top ::my-stack [s] (first (security-key s)))
    (defmethod stack-pop ::my-stack [s] (if (stack-empty? s) nil
    (my-stack (rest (security-key s)))))
    (defmethod stack-empty? ::my-stack [s] (empty? (security-key s))))

    In use it looks like a data type of its own:

    (def some-stack (-> (my-stack-new) (stack-push '1)
    (stack-push '2)
    (stack-push '3)))

    some-stack
    -> (my-stack 3 2 1)

    However, the internal representation is just hidden, not locked:

    (with-meta some-stack {})
    -> {:G__377 (3 2 1)}

    ReplyDelete
  4. lpetit: Although general marshaling and unmarshaling will not automatically apply to secure ADTs, it should be easy enough to add explicit marshaling and unmarshaling functions to the set of interface functions that have access to the security key. I hope you enjoy CTM! I think it's a fantastic book.

    khinsen: You are right that this list of implementations is by no means exhaustive. Applying multimethods to either the open or your semi-secure variant is a very reasonable approach. I don't think either of those approaches solves the "equality question" if the desired behavior is for two different stacks (perhaps implemented completely differently) to be considered equal if they are logically equivalent based on their contents.

    ReplyDelete
  5. If you want an equality test that is different from the one for Clojure's built-in data structures, I think there is no other way than to define your own test functions. clojure.contrib.generic.comparison provides multimethods for that purpose that you can implement for whatever data type you wish.

    However, you stil cannot use this with data represented by closures, because at the moment Clojure does not permit metadata on functions (but see issue 90 on Clojure's issue tracker).

    ReplyDelete
  6. How about using a proxy in the back-end? That way you could wrap state in a closure but also be compatible with Clojure's equality mechanism, that relies on Java's ==. If defining multi-methods could also modify a proxy definition (and probably the interface or abstract class it is implementing) then maybe you could implement methods as you do today, but get the added benefits of clean equality comparisons, meta-data and better Java interop. Ideally it would work something like this. You define a type and a set of methods that operate on a map for "objects" of this type. The objects would actually be proxies though, and the dispatch function would unwrap the map inside to hand it to the methods. The proxy would hide the internal representation, and it would also implement IMeta and IObj for meta-data support so that it can play well with the built-in type hierarchy mechanisms. It would be a pretty big bonus if ADTs defined in Clojure could easily be used by Java or other JVM based languages.

    ReplyDelete