When I joined Quipper and the School PH team in September last year, the team was knee-deep in developing a new feature for Quipper School LINK. The feature would allow school administrators to request and download a per-month report detailing their school’s usage of the Quipper platform.

As you can imagine, consolidating a month’s worth of data is no small feat – especially for a growing platform like ours. We wrote a class that would be in charge of querying all the data we needed, but it kept timing out for schools having a medium to large amount of data.

This all led me down a slippery slope, from what I thought was a small issue, to what became a weeks-long endeavor, of chipping away at this error I began to dread more and more:

Error stack trace: Timed out waiting on socket read.

This post is the first in a two-part series describing a few techniques I learned in that time to speed up the queries (and other operations) that caused our timeout woes at School PH.

Don’t `map` through a query result, use `FastQuery#values`!

::UserCourse.where(user_id: @result.user_id).map(&:course_id)

You might be familiar with the pattern above. You need all the values for Field X on a query for records of Model Y fulfilling Criteria Z. In the example above (which is an actual snippet from our codebase, by the way), we want the course_ids of all UserCourses having a particular user_id.

But doing it that way is slow because you’re querying a lot of excess data, which may eventually cause the whole operation to time out. The problem is well-expressed by an old adage in software engineering:

You wanted a banana but what you got was a gorilla holding the banana and the entire jungle.

Though those words were spoken in a different context, they still apply here. We wanted only one field (our banana), so why on earth are we also having to retrieve the 20-something other fields of that record?

We can do better with FastQuery, which is a wrapper class over the query engine of our codebase, Plucky.

Some of Plucky’s methods can be exposed by appending .fast to queries. One of those methods is values, which allows us to query a single field across records. So, we can rewrite the above query as:

::UserCourse.where(user_id: @result.user_id).fast.values(&:course_id)

The benefit of writing it this way is that Plucky:

“[Constructs] query proxy objects that only retrieve data from Mongo when needed. This allows a query to be composed of several conditions before being evaluated.”

Our query above is composed of two conditions: the first being the criteria for user_id, and the second the specifier for the piece of data we’re ultimately interested in, which in this case is course_ids.

We’re no longer querying all UserCourses with a particular user_id, and then throwing away the excess data to arrive at just the course_ids. We’re making a query for just the course_ids of all UserCourses having that particular user_id. We wanted just the banana and we’re getting just that – nothing more, nothing less!

I ran some tests on the above queries with the bmbm method of Ruby’s Benchmark class, which lists user, system, and total CPU time, as well as real-world time, when comparing operations. I used Benchmark::bmbm because it also takes into account any discrepancies caused by memory allocation and garbage collection by running tests twice.

All data presented are the numbers after its rehearsal trials. Below is my last run across three tests (time in seconds):

                          user     system      total        real
Array#map             0.008104   0.002983   0.011087 (  0.013191)
FastQuery#values      0.001266   0.000977   0.002243 (  0.004284)

On average, FastQuery#values was 5 times faster than an Array#map in total CPU time.

FastQuery can give us much better performance when we need just a single field from our query results. And remember, this all goes back to our needing to retrieve only that single field per record from the database, as opposed to all its fields!

Therefore, if you find yourself instinctively writing .map after your queries, pause for a second and remember that the same operation can be accomplished much faster with FastQuery#values instead!

If you need distinct values, write your own Plucky query!

While FastQuery also has a distinct method, unfortunately, it doesn’t leverage Plucky’s native distinct querying capabilities. Peeking into the source code, we can see that FastQuery#distinct is really just a wrapper over the values method discussed earlier, chained with an Array#uniq:

class FastQuery
  # ...

  def values(field)
    # ...
  end

  # Find a list of unique values from a query.
  # It is similar to Plucky's distinct method, but it performs a normal Mongo query, not a distinct query.
  # This makes it slightly slower, however with the benefit of returning sorted results.
  # @return [Array]
  def distinct(field)
    values(field).uniq
  end

  # ...
end

This is problematic because some of the data processing is being done on the Ruby-side of things. In general, databases are better suited for this kind of work because they are able to take advantage of their indexes, caching, and other optimizations.

The comment written by the code author, in fact, acknowledges that FastQuery’s implementation will indeed be slower than a native distinct Plucky query.

Fortunately, we can get around this by writing our own Plucky query, without having to interface with FastQuery:

Plucky::Query.new(@model.collection)

With that, you will have direct access to any of Plucky::Query’s methods, including its own distinct method. Below is a sample query (again from our codebase) and how we could rewrite it for direct Plucky::Query#distinct access:

# FastQuery#distinct
::Conversation.where(participant_ids: user.id).fast.distinct(:_id)

# Plucky::Query#distinct
query = Plucky::Query.new(::Conversation.collection)
query.where(participant_ids: user.id).distinct(:_id)

Admittedly, it is a bit more verbose, but the performance gains are, again, significant:

                             user     system      total        real
FastQuery#distinct       0.002311   0.000020   0.002331 (  0.004042)
Plucky::Query#distinct   0.000692   0.000000   0.000692 (  0.002462)

Above is my last run in three trials for the queries above, but the difference was consistent throughout. For total CPU time, Plucky::Query was over 4 times faster than FastQuery on average.

If your queries are taking too long to resolve – and the slightly longer code doesn’t bother you – consider interfacing with Plucky::Query directly. The results speak for themselves!

Cache the result of expensive computations

Pop quiz! True or False: In Ruby, immediately re-running an expensive method after it had just been invoked will take almost no time.

True: The runtime will cache the initial result.
False: The entire method will run again.

One of the reasons School PH’s usage report took so long to generate was that we weren’t able to cache the results of some of our database queries. We hid some of our database calls behind helper methods and called them repeatedly with the same inputs! This meant we were needlessly pinging our database for the same data over and over again.

And no, the Ruby runtime did not save us – so the answer to the above is false. The results of the helpers were not being cached. But we did end up caching them ourselves using a Hash:

def active_students(end_date)
  @active_students ||= Hash.new do |h, key|
    h[key] = OrganizationMembership.where(# ...)
  end

  @active_students[end_date]
end

The input that was being used for our query, in this case end_date, serves as the key for each query’s result in the Hash. So, for say an end_date of Date.new(2019, 4, 16), if the result of the query is empty ([]), subsequent calls to active_students with Date.new(2019, 4, 16) as input will no longer have to query the database.

It will instead find the key in our Hash and return the already-computed value. It may not seem like much in this example, but imagine if you had a query that took between 10 and 15 seconds to resolve. If you had a loop that called that same query over and over without caching, you could be in trouble!

I hope you were able to pick up on something new from these tips! If your timeouts haven’t yet resolved, worry not! Stay tuned for Part II of this series where we’ll tackle indexes and parallel processing to put them to rest for good!

As we like to say at the Q: tara for now! 🐟

Don’t map through a query result, use FastQuery#values!

If you need distinct values, write your own Plucky query!

Cache the result of expensive computations

Don’t `map` through a query result, use `FastQuery#values`!