Map/Reduce with CouchDB: a visual primer @ pablobm

18 July 2019 by Pablo Brasero Moreno

Map/Reduce with CouchDB: a visual primer

#couchdb

As part of a larger experiment, I recently played around a bit with CouchDB. Soon I had to run some map/reduce queries and realised that CouchDB handles these in a way that is significantly different from what many developers like me expect.

For those of us who have done simple map/reduce work with JavaScript, Ruby, Elixir, Python, etc, the model used by CouchDB can be surprising and confusing. There's a lot of documentation out there, but while figuring it out I felt I could use a visual explanation rather than walls of text.

Here's my attempt at providing such a visual explanation. I hope that this guide will help those who are as confused as I used to be.

Visits per day, month or year

The documents in this example will be from a hypothetical web analytics database. To illustrate the map/reduce process, I'll create a view that will allow us to query the total number of visits to the site, grouped by day, month or year. The data looks like this:

id	document
1	{ "uri": "/foo", "userAgent": "Edge", "visitedAt": "2019-05-22T12:44:32Z" }
2	{ "uri": "/bar", "userAgent": "Firefox", "visitedAt": "2019-05-30T08:05:07Z" }
3	{ "uri": "/foo", "userAgent": "Safari", "visitedAt": "2019-06-03T14:12:45Z" }
4	{ "uri": "/bar", "userAgent": "Firefox", "visitedAt": "2019-06-11T15:23:17Z" }
5	{ "uri": "/bar", "userAgent": "Firefox", "visitedAt": "2019-06-11T01:55:41Z" }

When thinking of map/reduce, we may think of a two-stage process. However, when thinking in terms of CouchDB I prefer to think of it as four steps.

1. Map the necessary detail required for the query

The process starts with the map step. When mapping on CouchDB, we emit key/value pairs:

The keys will be used to group and filter results. In this case, I'll use the visitedAt timestamp, as I want to group by day, month or year.
The values will be later combined to create the final results. In this case I want to count visits, and each document represents a single visit. Therefore, for each input record I'll emit a value of 1, a literal number one signifying a single visit.

The concept of "grouping" is governed by the emitted keys, and needs a bit more of clarification. They keys won't just be the literal visitedAt strings. Instead I will split those date strings into [year, month, day] arrays, which will allow me to group results by three different granularity levels: day, month, or year.

This is a way to implement the map step for this query:

1// The map function
2function (document) {
3  const [date, time] = document.visitedAt.split("T");
4  const [year, month, day] = date.split("-");
5  emit([year, month, day], 1);
6}

This map step can be visually represented as follows:

Map each document into a three items: its original id, a key as array of date components, and the integer value 1

The result is a list of visits, each represented by a date and each counting as a single (1) visit. Dates are represented as a three-element array, which will become useful in the next step.

2. Group intermediate results by key

Before going into the reduce step, CouchDB sorts the results of the map step into groups.

If your query included the group_level parameter, then this is used at this stage. It will dictate what portion of the keys (the ones generated in the map step) will be used as grouping key. For example:

group_level=3: group by the three items in the key (year, month and day).
group_level=2: group by the first two items in the key (year and month).
group_level=1: group by the first item in the key (the year only).
group_level=0: group all rows together into a single group (their date won't matter). This is the default.

This would be an example of grouping with group_level=2:

Results from the map step, grouped by month (group level 2)

However, this is not entirely what happens. At this stage, it's important to understand the distributed nature of CouchDB, as this is not abstracted from us. We will have to accomodate for it during the map/reduce process.

At the grouping step, the groups will be formed independently in each node of the CouchDB cluster. Again with an example of group_level=2, and assuming that your cluster has two nodes, the grouping will look more like this:

Results from the map step, grouped by cluster, then by month (group level 2)

This separation of results into nodes will be come important later, in the re-reduce stage.

3. Reduce within each DB node

At this point, we have groups of "equivalent" records. Equivalent because each record represents a single hit, and each group only contains hits on the same time segment (day, month or year). This is similar to what GROUP BY would get us in a SQL query.

Also similarly to GROUP BY, now we have to consolidate each group into a single result. In this case, these new results will represent the time segment of the group and the count of records in the group.

In more common flavours of map/reduce, the reduce function is called once for each record. In CouchDB however, it receives groups of inputs. Specifically, the reduce function appears to take two arguments:

A list of ids and keys from the map results.
A list of values from the map results.

First argument is a list of id/key pairs. Second argument is a list of values.

Therefore, a reduce function could be written like this:

1// First attempt at reduce
2function(keys, values) {
3  return values.length;
4}

The keys argument won't be used in this case. The values argument will just be a list of ones [1, 1, 1, ..., 1], and I want its length (or the sum, which is the same thing in this case).

After this step, our results will be reduced, but only within each node of the CouchDB cluster:

The process is not complete. There's still one step to go.

4. Re-reduce into the final results

The "sibling" results across the nodes now have to be consolidated into the final results for the whole database. CouchDB calls this step "re-reduce".

Groups re-reduced across clusters, leading to the final results

For the re-reduce step, CouchDB calls the reduce function again, but with different arguments:

First argument is null; second is a list of values from the same group; third is a boolean true.

This time there are no keys (the first argument is null), the values (second argument) are results from the reduce step, and there's a third argument. I'm calling this new argument isRereduce, and its value will be false during the reduce step and true during re-reduce.

So it turns out that the previous reduce function was not correct, because it needed to serve two different steps. We'll have to use the boolean third argument to tell which of two behaviours to implement:

 1// Reduce and re-reduce
 2function(_keys_, values, isRereduce) {
 3  if (isRereduce) {
 4    // `sum` is provided by CouchDB. It adds up all values in an array
 5    return sum(values);
 6  } else {
 7    return values.length;
 8  }
 9}

A keen eye may realise that this function could be simpler: we could just do sum in both cases. However this would work only because of the specific example we are working with here. For the sake of explaining the difference between reduce and re-reduce, I have preferred to do it this way instead.

Final notes

The row-counting in this example is very common for a reduce (and re-reduce). For this reason, CouchDB provides a built-in function to perform it, called _count. We can set the reduce property of the CouchDB view to the string value "_count" instead of providing the full function, and the result will be the same.

This is what the design document can look like now, using this shortcut:

{
  "views": {
    "entries_by_date": {
      "map": "function (document) { const date = document.date.split(\"T\")[0].split(\"-\"); emit(date, 1); }",
      "reduce": "_count"
    }
  }
}

Something that can trip you up: make sure not to name your functions! Provide your map and reduce functions as anonymous functions. In other words: start them with function(...) { and not with function map(...). Otherwise, you'll get an error like this one:

1Compilation of the map function in the 'entries_by_date' view failed: Expression does not eval to a function.

In order to carry out my research, I created a small project with a Docker image running CouchDB, as well as some Ruby scripts to populate and query it. You can find them at https://gitlab.com/pablobm/couchdb-research, along with instructions on how to run it yourself.