As part of a larger experiment, I recently played around a bit with CouchDB. Soon I had to run some map/reduce queries and realised that CouchDB handles these in a way that is significantly different from what many developers like me expect.
For those of us who have done simple map/reduce work with JavaScript, Ruby, Elixir, Python, etc, the model used by CouchDB can be surprising and confusing. There's a lot of documentation out there, but while figuring it out I felt I could use a visual explanation rather than walls of text.
Here's my attempt at providing such a visual explanation. I hope that this guide will help those who are as confused as I used to be.
Visits per day, month or year
The documents in this example will be from a hypothetical web analytics database. To illustrate the map/reduce process, I'll create a view that will allow us to query the total number of visits to the site, grouped by day, month or year. The data looks like this:
id | document |
---|---|
1 | { "uri": "/foo", "userAgent": "Edge", "visitedAt": "2019-05-22T12:44:32Z" } |
2 | { "uri": "/bar", "userAgent": "Firefox", "visitedAt": "2019-05-30T08:05:07Z" } |
3 | { "uri": "/foo", "userAgent": "Safari", "visitedAt": "2019-06-03T14:12:45Z" } |
4 | { "uri": "/bar", "userAgent": "Firefox", "visitedAt": "2019-06-11T15:23:17Z" } |
5 | { "uri": "/bar", "userAgent": "Firefox", "visitedAt": "2019-06-11T01:55:41Z" } |
When thinking of map/reduce, we may think of a two-stage process. However, when thinking in terms of CouchDB I prefer to think of it as four steps.
1. Map the necessary detail required for the query
The process starts with the map step. When mapping on CouchDB, we emit key/value pairs:
- The keys will be used to group and filter results. In this case, I'll use the
visitedAt
timestamp, as I want to group by day, month or year. - The values will be later combined to create the final results. In this case I
want to count visits, and each document represents a single visit. Therefore,
for each input record I'll emit a value of
1
, a literal number one signifying a single visit.
The concept of "grouping" is governed by the emitted keys, and needs a bit
more of clarification. They keys won't just be the literal visitedAt
strings.
Instead I will split those date strings into [year, month, day]
arrays,
which will allow me to group results by three different granularity levels:
day, month, or year.
This is a way to implement the map step for this query:
This map step can be visually represented as follows:
The result is a list of visits, each represented by a date and each counting as a single (1) visit. Dates are represented as a three-element array, which will become useful in the next step.
2. Group intermediate results by key
Before going into the reduce step, CouchDB sorts the results of the map step into groups.
If your query included the group_level
parameter, then this is used at this
stage. It will dictate what portion of the keys (the ones generated in the
map step) will be used as grouping key. For example:
group_level=3
: group by the three items in the key (year, month and day).group_level=2
: group by the first two items in the key (year and month).group_level=1
: group by the first item in the key (the year only).group_level=0
: group all rows together into a single group (their date won't matter). This is the default.
This would be an example of grouping with group_level=2
:
However, this is not entirely what happens. At this stage, it's important to understand the distributed nature of CouchDB, as this is not abstracted from us. We will have to accomodate for it during the map/reduce process.
At the grouping step, the groups will be formed independently in each node
of the CouchDB cluster. Again with an example of group_level=2
, and assuming
that your cluster has two nodes, the grouping will look more like this:
This separation of results into nodes will be come important later, in the re-reduce stage.
3. Reduce within each DB node
At this point, we have groups of "equivalent" records. Equivalent because
each record represents a single hit, and each group only contains hits on the
same time segment (day, month or year). This is similar to what GROUP BY
would get us in a SQL query.
Also similarly to GROUP BY
, now we have to consolidate each group into a
single result. In this case, these new results will represent the time
segment of the group and the count of records in the group.
In more common flavours of map/reduce, the reduce function is called once for each record. In CouchDB however, it receives groups of inputs. Specifically, the reduce function appears to take two arguments:
- A list of ids and keys from the map results.
- A list of values from the map results.
Therefore, a reduce function could be written like this:
The keys
argument won't be used in this case. The values
argument will just
be a list of ones [1, 1, 1, ..., 1]
, and I want its length (or the sum, which
is the same thing in this case).
After this step, our results will be reduced, but only within each node of the CouchDB cluster:
The process is not complete. There's still one step to go.
4. Re-reduce into the final results
The "sibling" results across the nodes now have to be consolidated into the final results for the whole database. CouchDB calls this step "re-reduce".
For the re-reduce step, CouchDB calls the reduce function again, but with different arguments:
This time there are no keys (the first argument is null), the values (second
argument) are results from the reduce step, and there's a third argument. I'm
calling this new argument isRereduce
, and its value will be false
during
the reduce step and true
during re-reduce.
So it turns out that the previous reduce function was not correct, because it needed to serve two different steps. We'll have to use the boolean third argument to tell which of two behaviours to implement:
A keen eye may realise that this function could be simpler: we could just do
sum
in both cases. However this would work only because of the specific
example we are working with here. For the sake of explaining the difference
between reduce and re-reduce, I have preferred to do it this way instead.
Final notes
The row-counting in this example is very common for a reduce (and
re-reduce). For this reason, CouchDB provides a built-in function to perform
it, called _count
. We can set the reduce
property of the CouchDB view
to the string value "_count"
instead of providing the full function, and
the result will be the same.
This is what the design document can look like now, using this shortcut:
Something that can trip you up: make sure not to name your functions! Provide
your map and reduce functions as anonymous functions. In other words: start them
with function(...) {
and not with function map(...)
. Otherwise, you'll get
an error like this one:
1Compilation of the map function in the 'entries_by_date' view failed: Expression does not eval to a function.
In order to carry out my research, I created a small project with a Docker image running CouchDB, as well as some Ruby scripts to populate and query it. You can find them at https://gitlab.com/pablobm/couchdb-research, along with instructions on how to run it yourself.