Inverted index with Redis

Redis is a NoSQL database of key-value type. When we say this, it comes to our head the fact that being key value only serves for Cache-type implementations, but fortunately it is not.

We will see how to create an inverted index that is used for full-text type searches.

You could read this post in Spanish here.

But first of all, what is an inverted index? It is an index that is kept in a separate file - for this reason, the inverted index is also

known as inverted file - and it can refer to more than one source.

To give a practical example, in the Domain Driven Design book, we find the word continuous on page 15 when it talks about continuous learning and on page 341 when it talks about continuous integration. While in the Continuous Delivery book is on page 36. An inverted index would be a document where only the references of the word continuous would be maintained by importance order. For example, it can be believed that if someone searches for the word continuous almost surely is looking for continuous integrations, so the Continuous Delivery book will serve more, then we'll put the second reference of the Domain Driven Design book and finally the reference that talks about continuous learning.

Document Order Chapter Page Paragraph Position
Domain Driven Design 3 1 15 4 1
Domain Driven Design 2 14 341 1 1
Continuous Delivery 1 2 36 4 5

The aim is to have much faster searches regarding full-text search type, and against it will take much longer to add a document because all words have to be analyzed, words that do not have semantically value (a, al, If, etc.) must be discarded and new files must be created for new words or indexes must be updated if they already exist.

How do we do it with Redis?

For each word we create an ordered list being the word to search the key of the list. In the ordered list we add the position of the ranking that will occupy and a reference or alias of the position where it is.

ZADD continuous 3 refDDD1
ZADD continuous 2 refDDD2
ZADD continuous 1 refCD1

Note: I am using the Redis commands directly so that you can play it with the redis-cli.exe that is in the installation directory of Redis itself. I do not want to limit the example to a platform and a single driver.

The result after processing the entries for continuous, integration and learning would be this

To make an autocomplete type search, we add the necessary n-gram. As an example, I will do it with the first reference

    ZADD c 1 refCD1
    ZADD co 1 refCD1
    ZADD con 1 refCD1
    ZADD cont 1 refCD1
    ZADD conti 1 refCD1
    ZADD contin 1 refCD1
    ZADD continu 1 refCD1
    ZADD continuo 1 refCD1
    ZADD continuou 1 refCD1
    ZADD continuous 1 refCD1

    ZADD c 2 refDD2
    ZADD co 2 refDD2
    ZADD con 2 refDD2
    ....

Then each alias is placed as the key of a hash structure with the other values that we are interested on, which are on the search, such as a document, chapter, and page.

    HMSET refDDD1 document “Domain Driven Design” chapter 1 page 15 paragraph 4 position 1
    HMSET refDDD2 document “Domain Driven Design” chapter 14 page 341 paragraph 1 position 1
    HMSET refCD1 document “Continuous Delivery” chapter 2 page 36 paragraph 4 v5

And now let's search!

When we have all the words with their n-grams, we have to apply the search algorithm. Let's take the example someone writes "con" as a keyword.

    zrange con 0 4

With this command, we will get the first five words (the start and end position are included) from the sorted list which has the "con" key. As it is an ordered list the result will be:

    refCD1 
    refDDD2
    refDDD1

We take these values ??and look for each value as a hash to get the name of the document

    HGET refCD1 document

The result of this last command will be

    Continuous Delivery

Performance

According to the Redis documentation, the zrange command is an operation with a cost O (log (N) + M) cost where N is the number of records that has the ordered list (ie our inverse index) and M the number of items that we want to return.

While the HGET command has an O (1) complexity, that is, it has the same cost, no matter how large the hash.

If in the example we had to return not only the name of the document but also the title, paragraph, etc ... we would have to use the command HGETALL, this command has a cost of O (N), that is to say, it increases the more properties Of the document we saved.

If what you like are numbers of an application in production, I tell you ... for a ranking of 150,000 positions, with an average of 16 words per position that included the name of the product and synonymous words, a ZRANGE is made that takes 30 Records and then 30 HGET searches to obtain the product title, that is 31 searches, and it takes 4 milliseconds in an i7 to 2.3 Ghz and it does not take more than 400 Mb of RAM.