When I joined Quipper and the School PH team in September last year, the team was knee-deep in developing a new feature for Quipper School LINK. The feature would allow school administrators to request and download a per-month report detailing their school’s usage of the Quipper platform.
As you can imagine, consolidating a month’s worth of data is no small feat – especially for a growing platform like ours. We wrote a class that would be in charge of querying all the data we needed, but it kept timing out for schools having a medium to large amount of data.
This all led me down a slippery slope, from what I thought was a small issue, to what became a weeks-long endeavor, of chipping away at this error I began to dread more and more:
Error stack trace: Timed out waiting on socket read.
This post is the first in a two-part series describing a few techniques I learned in that time to speed up the queries (and other operations) that caused our timeout woes at School PH.
map through a query result, use
You might be familiar with the pattern above. You need all the values for Field X on a query for records of Model Y fulfilling Criteria Z. In the example above (which is an actual snippet from our codebase, by the way), we want the
course_ids of all
UserCourses having a particular
But doing it that way is slow because you’re querying a lot of excess data, which may eventually cause the whole operation to time out. The problem is well-expressed by an old adage in software engineering:
You wanted a banana but what you got was a gorilla holding the banana and the entire jungle.
Though those words were spoken in a different context, they still apply here. We wanted only one field (our banana), so why on earth are we also having to retrieve the 20-something other fields of that record?
We can do better with
FastQuery, which is a wrapper class over the query engine of our codebase,
Plucky’s methods can be exposed by appending
.fast to queries. One of those methods is
values, which allows us to query a single field across records. So, we can rewrite the above query as:
The benefit of writing it this way is that Plucky:
“[Constructs] query proxy objects that only retrieve data from Mongo when needed. This allows a query to be composed of several conditions before being evaluated.”
Our query above is composed of two conditions: the first being the criteria for
user_id, and the second the specifier for the piece of data we’re ultimately interested in, which in this case is
We’re no longer querying all
UserCourses with a particular
user_id, and then throwing away the excess data to arrive at just the
course_ids. We’re making a query for just the
course_ids of all
UserCourses having that particular
user_id. We wanted just the banana and we’re getting just that – nothing more, nothing less!
I ran some tests on the above queries with the
bmbm method of Ruby’s
Benchmark class, which lists user, system, and total CPU time, as well as real-world time, when comparing operations. I used
Benchmark::bmbm because it also takes into account any discrepancies caused by memory allocation and garbage collection by running tests twice.
All data presented are the numbers after its rehearsal trials. Below is my last run across three tests (time in seconds):
user system total real Array#map 0.008104 0.002983 0.011087 ( 0.013191) FastQuery#values 0.001266 0.000977 0.002243 ( 0.004284)
FastQuery#values was 5 times faster than an
Array#map in total CPU time.
FastQuery can give us much better performance when we need just a single field from our query results. And remember, this all goes back to our needing to retrieve only that single field per record from the database, as opposed to all its fields!
Therefore, if you find yourself instinctively writing
.map after your queries, pause for a second and remember that the same operation can be accomplished much faster with
If you need distinct values, write your own Plucky query!
FastQuery also has a
distinct method, unfortunately, it doesn’t leverage
distinct querying capabilities. Peeking into the source code, we can see that
FastQuery#distinct is really just a wrapper over the
values method discussed earlier, chained with an
class FastQuery # ... def values(field) # ... end # Find a list of unique values from a query. # It is similar to Plucky's distinct method, but it performs a normal Mongo query, not a distinct query. # This makes it slightly slower, however with the benefit of returning sorted results. # @return [Array] def distinct(field) values(field).uniq end # ... end
This is problematic because some of the data processing is being done on the Ruby-side of things. In general, databases are better suited for this kind of work because they are able to take advantage of their indexes, caching, and other optimizations.
The comment written by the code author, in fact, acknowledges that
FastQuery’s implementation will indeed be slower than a native distinct
Fortunately, we can get around this by writing our own
Plucky query, without having to interface with
With that, you will have direct access to any of
Plucky::Query’s methods, including its own
distinct method. Below is a sample query (again from our codebase) and how we could rewrite it for direct
# FastQuery#distinct ::Conversation.where(participant_ids: user.id).fast.distinct(:_id) # Plucky::Query#distinct query = Plucky::Query.new(::Conversation.collection) query.where(participant_ids: user.id).distinct(:_id)
Admittedly, it is a bit more verbose, but the performance gains are, again, significant:
user system total real FastQuery#distinct 0.002311 0.000020 0.002331 ( 0.004042) Plucky::Query#distinct 0.000692 0.000000 0.000692 ( 0.002462)
Above is my last run in three trials for the queries above, but the difference was consistent throughout. For total CPU time,
Plucky::Query was over 4 times faster than
FastQuery on average.
If your queries are taking too long to resolve – and the slightly longer code doesn’t bother you – consider interfacing with
Plucky::Query directly. The results speak for themselves!
Cache the result of expensive computations
Pop quiz! True or False: In Ruby, immediately re-running an expensive method after it had just been invoked will take almost no time.
- True: The runtime will cache the initial result.
- False: The entire method will run again.
One of the reasons School PH’s usage report took so long to generate was that we weren’t able to cache the results of some of our database queries. We hid some of our database calls behind helper methods and called them repeatedly with the same inputs! This meant we were needlessly pinging our database for the same data over and over again.
And no, the Ruby runtime did not save us – so the answer to the above is false. The results of the helpers were not being cached. But we did end up caching them ourselves using a
def active_students(end_date) @active_students ||= Hash.new do |h, key| h[key] = OrganizationMembership.where(# ...) end @active_students[end_date] end
The input that was being used for our query, in this case
end_date, serves as the key for each query’s result in the
Hash. So, for say an
Date.new(2019, 4, 16), if the result of the query is empty (
), subsequent calls to
Date.new(2019, 4, 16) as input will no longer have to query the database.
It will instead find the key in our
Hash and return the already-computed value. It may not seem like much in this example, but imagine if you had a query that took between 10 and 15 seconds to resolve. If you had a loop that called that same query over and over without caching, you could be in trouble!
I hope you were able to pick up on something new from these tips! If your timeouts haven’t yet resolved, worry not! Stay tuned for Part II of this series where we’ll tackle indexes and parallel processing to put them to rest for good!
As we like to say at the Q: tara for now! 🐟