Top-k queries in Cassandra: An embedded mapreduce approach

by | Mar 13, 2015 | Developer | 0 comments

Stratio has just added top-k queries support to its Lucene based implementation of the Cassandra’s secondary indexes. This implementation was originally designed to allow embedded full-text and multivariable search in Apache Cassandra. The previous release included an ad-hoc mechanism to perform distributed relevance queries based on the Lucene’s scoring algorithm. The current release generalizes this mechanism to allow several types of top-k queries.

What is a Top-k query?

Top-k queries are those that get the k best results for a query according to some user-defined criteria, assigning a score to each result. Examples of use can be to retrieve the k last tweets of a given user, the k best scores of a gaming leaderboard or the k most relevant news in a newspaper for a given full-text search.

Lucene provides top-k queries out of the box. It uses its index structure to avoid full data scan. However its usage is limited to a single machine, so the challenge here was to make it work in the distributed environment of Cassandra. In the Cassandra’s standard secondary indexes implementation the coordinator node iterates sequentially over all the nodes in the cluster collecting the locally indexed data until the number of requested results is satisfied. It can be seen somehow as a sequential mapreduce algorithm where the map phase is querying locally indexed data and the reduce phase is just adding collected data. Nodes are travelled sequentially because it allows us to stop the iteration when enough results have been collected.


However, top-k queries -and many other- must visit all the nodes in the cluster because the most relevant results can be in any of these. This collect process can be done in parallel to reduce latency. Also, the partial node results must be combined with a specific logic in the coordinator node, not just collected. We have modified the secondary indexes query mechanism to provide a more general, reusable, optionally parallel, mapreduce system. So, our Lucene-based top-k queries are just a particular use case of its abstract mechanism.

How to perform Top-k queries

First, we have added a customizable combine(command, rows) method to the secondary index interface which is called by the coordinator to properly merge the partial results according to a custom logic defined in the concrete index implementation. Second, we have added to the 2i interface a requiresFullScan(command) method to specify when a full parallel cluster scan must be performed. Top-k queries needs this full-scan because the most relevant results can be in any node in the cluster whereas the embedded 2i approach does not need to traverse all the nodes. Finally, we have modified the Cassandra’s storage proxy to parallelize the node querying when full cluster scan is required. Thus, we get not only a way to perform top-k queries over our index, but we have a general way to run simple mapreduce algorithms inside Cassandra.

As an example, we can perform relevance queries such as retrieving the 100 most relevant tweets containing the phrase “big data”:

And we can retrieve the 20 most recent tweets containing the phrase “big data”:

The modified version of Apache Cassandra with the Lucene-based secondary indexes is publically available in GitHub under the Apache License, v2.