Reductionem ad finem

Table of Contents

1 What are reducers?

There are two concepts that underpin reducers:

  1. Collections are defined solely as something that can be reduced.
  2. On demand (lazy) computation via transformation and composition of intermediate functions, not transformation of intermediate values.

#1 is pretty straight forward: the only thing required for something to be considered a “collection” in the reducers library sense is it has to implement the clojure.core.protocols/CollReduce protocol. Because the only requirement is that collections be reducible, the reducers library is built on only manipulating collections via reduce.

#2 is where most explanations tend to bog down, so instead I will just suggest you read what rhickey wrote: Anatomy of a Reducer. The gist is the combinators in the reducers library like r/map take a reducible collection as input and produce a reducible collection as output, and the operations you mapped, filtered, etc are only run when the reducible collection is actually reduced. Reducing has the effect of forcing or demanding concrete values from the reducible collections where as r/map and r/filter can pass the buck “I won't do something now, but when you need it, sure”.

After that you might ask “Given that lazy seqs already exist, what is the point of another library for lazy/on demand collection processing?” well read on…

2 Why use reducers?

2.1 Reducers make data parallelism easy.

The “on demand” or “lazy” formulation of reducers is distinct from the laziness offered by lazy seqs. Lazy seqs are fundamentally a linear list, where the rest of the list might actually be a thunk that computes the rest on demand. To take the 256th value of a lazy seq you have to start at the beginning and walk over 256 elements. This linear structure makes lazy seqs unsuitable as a basis for data parallelism.

The reduce operation as defined by the reducers library makes no promises about the order in which elements will be reduced, so for example if you have a tree shaped data structure, you could reduce each sub-tree to a single value, then reduce the list of sub-tree results in to a single value. Sub-tree reduction in this manner is a good match for parallelization via the fork/join library that ships with recent JVMs.

In order to avoid surprise the reducers library doesn't run reduce in parallel, but provides a new function fold which will reduce a reducible collection in parallel if possible.

Due to the reducers library providing similar combinators to what clojure.core provides you can swap in data parallel operations for linear operations without changing the general shape of the code in many places:

(def numbers (vec (range 1e6)))

;; runs in sequence
(->> numbers
     (map inc)
     (filter even?)
     (reduce +))

;; runs in parallel
(->> numbers
     (r/map inc)
     (r/filter even?)
     (r/fold +))

The combinators in the reducers library also have some auto currying support, which means you can compose them in ways that would be cumbersome using the seq combinators.

That is neat, and if you watch this talk by rhickey you can tell he thinks it is “so cool”, but I have trouble getting excited about it. At my day job we have what is basically a large distributed ETL system and it is hard to match the granular data parallelism you get from reducers with the less granular concurrency in that system. A distributed reducers style library built on something like fork/join via message queues or something might be more our speed…

So let me tell you why I really care about reducers.

2.2 Reducers put producers in control.

Resources can be represented as collections, for example a collection of lines in a file, or a collection of rows in database.

These collections that are backed by resources are often constructed incrementally (lazily) to avoid pulling more information than required. This incremental construction means resources for these collections need to exist outside the scope of the creation of the collection.

The problem is the life time of collections are managed in memory by the GC, but leaving the management of resources like file descriptors to the GC is not a good idea.

When trying to represent a resource in Clojure a lazy seq is often used. Lazy seqs do not provide a good way to manage the life cycle of resources that the lazy seq is backed by (files, databases, etc). There are two basic approaches:

  1. Attach any resource cleanup actions to some dynamic scope and finish using the lazy seq before you exit that scope.
  2. Construct the lazy seq so forcing the last element of the seq also preforms any cleanup.

Both of these approaches have downsides, for example having to consume the entire lazy seq to perform cleanup if you just want the first 10 elements is inefficient and/or having to liter your code with dynamic scopes to manage resources while ensuring that a given resource is never used outside of its scope is a hassle.

Because collections in the reducers sense can have custom implementations of reduce, the collections have control over the environment in which the reduce takes place.

The same inversion of control mechanism that allows for reducers to have custom parallel reduction strategies for a given collection also provide a nice way to control a resource's life cycle, while still representing the resource as an “on demand” computable collection.

2.2.1 Examples

  • ByteSource
(deftype ByteSource [source]
    [coll f]
    (p/coll-reduce coll f (f)))
    [coll f init]
    (with-open [x (io/input-stream source)]
      (loop [n (.read x)
             v init]
        (if (= -1 n)
          (let [r (f v n)]
            (if (reduced? r)
              (recur (.read x) r))))))))

;; user=> (count (reduce conj [] (ByteSource. "/etc/passwd")))
;; 5086
;; user=>

ByteSource provides a reducible source of bytes. When you reduce it the bytes are read out of a an InputStream, and when the reduction completes the InputStream is safely closed. To consumers of ByteSource it looks like any other reducible thing and they don't have to care about ensuring the InputStream is closed after use.

  • query-source
(defn query-source [con query]
    (coll-reduce [coll f]
      (sql/with-connection con
        (sql/with-query-results results
          (reduce f (first results) (rest results)))))
    (coll-reduce [coll f init]
      (sql/with-connection con
        (sql/with-query-results results
          (reduce f init results))))))

The function query-source uses to provide a reducible SQL query result. Again a consumer doesn't need to know this reducible collection is tied to an external resource that needs to be managed.

  • rfor
(defmacro rfor [bindings body]
  (let [bindings (partition-all 2 bindings)
        lbinding (last bindings)
        bindings (butlast bindings)]
     (fn [form [n vs]]
       `(r/mapcat (fn [~n] ~form) ~vs))
     `(r/map (fn [~(first lbinding)] ~body) ~(last lbinding))
     (reverse bindings))))

The macro rfor is a very simplified (it is missing lots of features) version of Clojure's for macro that works on reducible collections instead of seqs. This works because for is a seq comprehension built on the seq operations map and mapcat, and the reducible library provides r/map and r/mapcat for reducible collections.

(->> (rfor [byte (ByteSource. "/etc/passwd")
            row (query-source ... ...)]
       [byte row])
     (r/take 10)
     (into []))

Not a care in the world about resource handling, beautiful.

3 How is this not a win-win?

The reducers library is still relatively new, so there are definitely rough edges.

For one because they are new and don't have a large history of use and re-use there aren't a nice set of best practices around them. There also aren't a lot of helper macros or functions for building your own resource backed reducible collections.

The terminology for concepts in the reducers library also still takes me a while to puzzle out, and I am not sure if I have it right.

  • “reducers” are the combinators (r/map, r/filter)
  • reducible collections are just “reducibles”
  • the functions you pass to the combinators are “reducing functions”

Speaking of combinators, the reducers library is still growing and doesn't yet include all the same combinators that exist for seqs. Some nice seq based macros like for and doseq also don't exist yet. And I don't think combining the parallel worlds of seqs and reducers in one place will ever be what you would call pretty.

Finally the fact that reducers do not cache results from the first traversal like lazy seqs, but instead recompute results every time is just different enough to trip you up at least once.

4 Conventions for namespace names

  • p as in p/CollReduce refers to clojure.core.protocols
  • r as in r/map refers to clojure.core.reducers
  • sql as in sql/with-connection refers to

5 Some Related bits of Clojure history

There have been a few branches of work rhickey has done on Clojure overlap in design space with reducers:

  • Streams are basically non-caching lazy seqs. The way the combinators are constructed reminds me a lot of reducers. Streams are also old at this point, they were an experiment back when Clojure was using SVN for version control and don't seem to have been touched since. It is still possible something interesting could result, but at this point it seems unlikely.
  • Scopes are a formalized system of dynamic scopes you can attach cleanup actions to. They are around the same age as streams, and in fact some initial work on scopes was done on the streams branch. There definitely seems to be more continued interest in scopes than in streams.

6 Etc.

7 Thanks

  • To Joe Gallo for suggesting I write this up
  • To Paul Stadig for pointing out that I have lost the ability to manually balance parens
  • To past and present coworkers at Sonian, Inc. who have given critiques

Date: 2013-10-10T23:19:25Z


Org version 7.8.11 with Emacs version 24

Validate XHTML 1.0