Reductionem ad finem
Table of Contents
1 What are reducers?
There are two concepts that underpin reducers:
- Collections are defined solely as something that can be reduced.
- On demand (lazy) computation via transformation and composition of intermediate functions, not transformation of intermediate values.
#1 is pretty straight forward: the only thing required for something
to be considered a “collection” in the reducers library sense is it
has to implement the clojure.core.protocols/CollReduce
protocol. Because the only requirement is that collections be
reducible, the reducers library is built on only manipulating
collections via reduce.
#2 is where most explanations tend to bog down, so instead I will
just suggest you read what rhickey wrote: Anatomy of a Reducer. The
gist is the combinators in the reducers library like r/map
take a
reducible collection as input and produce a reducible collection as
output, and the operations you mapped, filtered, etc are only run
when the reducible collection is actually reduced. Reducing has the
effect of forcing or demanding concrete values from the reducible
collections where as r/map
and r/filter
can pass the buck “I
won't do something now, but when you need it, sure”.
After that you might ask “Given that lazy seqs already exist, what is the point of another library for lazy/on demand collection processing?” well read on…
2 Why use reducers?
2.1 Reducers make data parallelism easy.
The “on demand” or “lazy” formulation of reducers is distinct from
the laziness offered by lazy seqs. Lazy seqs are fundamentally a
linear list, where the rest
of the list might actually be a thunk that
computes the rest
on demand. To take the 256th value of a lazy
seq you have to start at the beginning and walk over 256
elements. This linear structure makes lazy seqs unsuitable as a
basis for data parallelism.
The reduce
operation as defined by the reducers library makes no
promises about the order in which elements will be reduced, so for
example if you have a tree shaped data structure, you could reduce
each sub-tree to a single value, then reduce the list of sub-tree
results in to a single value. Sub-tree reduction in this manner
is a good match for parallelization via the fork/join library that
ships with recent JVMs.
In order to avoid surprise the reducers library doesn't run reduce
in parallel, but provides a new function fold
which will reduce a
reducible collection in parallel if possible.
Due to the reducers library providing similar combinators to what
clojure.core
provides you can swap in data parallel operations for
linear operations without changing the general shape of the code in
many places:
(def numbers (vec (range 1e6))) ;; runs in sequence (->> numbers (map inc) (filter even?) (reduce +)) ;; runs in parallel (->> numbers (r/map inc) (r/filter even?) (r/fold +))
The combinators in the reducers library also have some auto currying support, which means you can compose them in ways that would be cumbersome using the seq combinators.
That is neat, and if you watch this talk by rhickey you can tell he thinks it is “so cool”, but I have trouble getting excited about it. At my day job we have what is basically a large distributed ETL system and it is hard to match the granular data parallelism you get from reducers with the less granular concurrency in that system. A distributed reducers style library built on something like fork/join via message queues or something might be more our speed…
So let me tell you why I really care about reducers.
2.2 Reducers put producers in control.
Resources can be represented as collections, for example a collection of lines in a file, or a collection of rows in database.
These collections that are backed by resources are often constructed incrementally (lazily) to avoid pulling more information than required. This incremental construction means resources for these collections need to exist outside the scope of the creation of the collection.
The problem is the life time of collections are managed in memory by the GC, but leaving the management of resources like file descriptors to the GC is not a good idea.
When trying to represent a resource in Clojure a lazy seq is often used. Lazy seqs do not provide a good way to manage the life cycle of resources that the lazy seq is backed by (files, databases, etc). There are two basic approaches:
- Attach any resource cleanup actions to some dynamic scope and finish using the lazy seq before you exit that scope.
- Construct the lazy seq so forcing the last element of the seq also preforms any cleanup.
Both of these approaches have downsides, for example having to consume the entire lazy seq to perform cleanup if you just want the first 10 elements is inefficient and/or having to liter your code with dynamic scopes to manage resources while ensuring that a given resource is never used outside of its scope is a hassle.
Because collections in the reducers sense can have custom implementations of reduce, the collections have control over the environment in which the reduce takes place.
The same inversion of control mechanism that allows for reducers to have custom parallel reduction strategies for a given collection also provide a nice way to control a resource's life cycle, while still representing the resource as an “on demand” computable collection.
2.2.1 Examples
- ByteSource
(deftype ByteSource [source] p/CollReduce (p/coll-reduce [coll f] (p/coll-reduce coll f (f))) (p/coll-reduce [coll f init] (with-open [x (io/input-stream source)] (loop [n (.read x) v init] (if (= -1 n) v (let [r (f v n)] (if (reduced? r) @r (recur (.read x) r)))))))) ;; user=> (count (reduce conj [] (ByteSource. "/etc/passwd"))) ;; 5086 ;; user=>
ByteSource provides a reducible source of bytes. When you reduce it the bytes are read out of a an InputStream, and when the reduction completes the InputStream is safely closed. To consumers of ByteSource it looks like any other reducible thing and they don't have to care about ensuring the InputStream is closed after use.
- query-source
(defn query-source [con query] (reify p/CollReduce (coll-reduce [coll f] (sql/with-connection con (sql/with-query-results results query (reduce f (first results) (rest results))))) (coll-reduce [coll f init] (sql/with-connection con (sql/with-query-results results query (reduce f init results))))))
The function query-source
uses clojure.java.jdbc
to provide a
reducible SQL query result. Again a consumer doesn't need to know
this reducible collection is tied to an external resource that
needs to be managed.
- rfor
(defmacro rfor [bindings body] (let [bindings (partition-all 2 bindings) lbinding (last bindings) bindings (butlast bindings)] (reduce (fn [form [n vs]] `(r/mapcat (fn [~n] ~form) ~vs)) `(r/map (fn [~(first lbinding)] ~body) ~(last lbinding)) (reverse bindings))))
The macro rfor
is a very simplified (it is missing lots of
features) version of Clojure's for
macro that works on reducible
collections instead of seqs. This works because for
is a seq
comprehension built on the seq operations map
and mapcat
, and
the reducible library provides r/map
and r/mapcat
for
reducible collections.
(->> (rfor [byte (ByteSource. "/etc/passwd") row (query-source ... ...)] [byte row]) (r/take 10) (into []))
Not a care in the world about resource handling, beautiful.
3 How is this not a win-win?
The reducers library is still relatively new, so there are definitely rough edges.
For one because they are new and don't have a large history of use and re-use there aren't a nice set of best practices around them. There also aren't a lot of helper macros or functions for building your own resource backed reducible collections.
The terminology for concepts in the reducers library also still takes me a while to puzzle out, and I am not sure if I have it right.
- “reducers” are the combinators (
r/map
,r/filter
) - reducible collections are just “reducibles”
- the functions you pass to the combinators are “reducing functions”
Speaking of combinators, the reducers library is still growing and
doesn't yet include all the same combinators that exist for
seqs. Some nice seq based macros like for
and doseq
also don't
exist yet. And I don't think combining the parallel worlds of seqs
and reducers in one place will ever be what you would call pretty.
Finally the fact that reducers do not cache results from the first traversal like lazy seqs, but instead recompute results every time is just different enough to trip you up at least once.
4 Conventions for namespace names
p
as inp/CollReduce
refers toclojure.core.protocols
r
as inr/map
refers toclojure.core.reducers
sql
as insql/with-connection
refers toclojure.java.jdbc
5 Some Related bits of Clojure history
There have been a few branches of work rhickey has done on Clojure overlap in design space with reducers:
- Streams are basically non-caching lazy seqs. The way the combinators are constructed reminds me a lot of reducers. Streams are also old at this point, they were an experiment back when Clojure was using SVN for version control and don't seem to have been touched since. It is still possible something interesting could result, but at this point it seems unlikely.
- Scopes are a formalized system of dynamic scopes you can attach cleanup actions to. They are around the same age as streams, and in fact some initial work on scopes was done on the streams branch. There definitely seems to be more continued interest in scopes than in streams.
6 Etc.
- Reducers a Library and Model for Collection Processing
- Reducers (video)
- Organizing Functional Code for Parallel Execution or, foldl and foldr Considered Slightly Harmful
- “Reductionem ad finem” is french for “Remember Your Mother”
- The
p/CollReduce
protocol is arguably not part of the reducers library, it was actually introduced a fair bit before the reducers library was added. On the other hand it definitely enables the rest of the reducers architecture and something had to come first and the reducers library has shined a bit of a spot light on to it.
7 Thanks
- To Joe Gallo for suggesting I write this up
- To Paul Stadig for pointing out that I have lost the ability to manually balance parens
- To past and present coworkers at Sonian, Inc. who have given critiques