A fast, fuzzy, full-text index using Redis

amix · on May 5, 2010

I think this is a terrible idea, even thought I like Redis and use it in my projects. Why is it a bad idea? Because Redis keeps everything in memory and scaling it up will be quite expensive on very large datasets. Even when Redis VM comes out (which will eliminate the all the data in memory requirement), it's a bad idea as Redis isn't really optimized for full-text search...

What's better? Use a tool that's optimized for the job. I would recommend looking at Sphinx, which is quite amazing and can handle indexing billions of words on a single server without using much CPU or memory. Plus Sphinx has tons of features, such as geo-based search, fuzzy search, boolean search, full support for Unicode etc. etc.

donw · on May 5, 2010

If you're going to look at Sphinx, you'd do yourself a disservice by not also looking at Solr.

Solr has effectively the same feature set, including geospatial and multilingual searching, but it has better relevance, and generally is faster at returning query results (although Sphinx is faster at doing full reindexes).

Also, unlike Sphinx, Solr doesn't glue you to MySQL. Or to any SQL database; use it with Cassandra, Voldemort, text files, quantum storage in the galactic hive-mind, whatever.

(edit: They've added PostgreSQL support, and raw XML support, since I last installed and tested Sphinx about a year ago)

Plus, unlike Sphinx, you don't need to reindex when you add new records, because Solr can seamlessly merge indexes on-the-fly.

I know I sound like a fanboy here, but I spent a lot of time evaluating the two of them for our product, and Sphinx just didn't fit the bill for a large number of reasons. It's a good solution if you just need fulltext indexing in a MySQL database, but if you want to move beyond that, have a look at Solr.

http://beerpla.net/2009/09/03/comparison-between-solr-and-sp...

petercooper · on May 5, 2010

Or, if you don't want yet another daemon, Xapian. It's almost the SQLite of full text searching. Has its limits, but is crazy fast and scales well.

jawngee · on May 5, 2010

+1 for Solr.

We're replacing our text indexing solution with Solr and it's been a phenomenal transition.

kolektiv · on May 5, 2010

Yes, it's a nice little demonstration of what you could perhaps call (not unkindly) naive full text search. It misses quite a few functions which a more mature solution would give which you might want quite badly. Relevance is an obvious example - this doesn't allow you to order results by how commonly words occur, which is almost certainly the users wish.

You could probably extend this to include that kind of thing with some clever hacking, but as it stands this is still a long way from competing with something like Sphinx. Still - it's a nice little demo of using Redis for something and taking advantage of the specific strengths of Redis such as first class set support.

adamcharnock · on May 5, 2010

Yep, this solution is certainly quite a way off being something with the capabilities of Sphinx. In our case we didn't need any scoring of results, but this could be partly achieved by performing a Levenshtein Distance between the search term and the matched word. However, counting the number of occurrences (or other weightings) would require a bit more thought.

A larger, Redis-backed search engine would certainly be an interesting project though...

kolektiv · on May 5, 2010

Yes it certainly would. I have to admit that it's crossed my mind more than once. The set capabilities could lend itself very well to some high performance faceting functionality, so you might end up with something between Lucene and Solr. There is definitely a use case for something like that - Sphinx does excel on large data sets, but there are a lot of people who know that they'll only ever need to be searching a few hundred megabytes of text anyway. You wouldn't want to use that for a scaling startup perhaps, but the world is larger than that...

tlb_ · on May 6, 2010

You can use sorted sets to find out how commonly words occur.

redis.zincrby(metaphone_key, item.item_id, 1)

This will allow you to sort items by highest number of occurrences

adamcharnock · on May 5, 2010

Hey amix, thanks for the comments!

I certainly agree that a purpose-built solution could be better if you needed the features you mention, but that was not our use-case. This provided an elegant solution which would avoid us having to duplicate all our item metadata. Also, given our schedule it seemed clear that Redis VM would be available by the time we needed it.

It is true that Redis is not optimised specifically for full-text search. However, it is optimised for serving data structures very quickly which is exactly what we needed.

amix · on May 5, 2010

In Sphinx you don't store meta data - you just store an inverted index going from words to ids. And I bet that setting up Sphinx would have been much easier than rolling out your own solution ;-)

Btw. like I mentioned Sphinx does support fuzzy search (and boolean search), so I don't really know what you win by rolling your own solution in Redis, other than worse scalability and a crippled full-text search.

adamcharnock · on May 5, 2010

Sphinx is an excellent option for full-text search. The post didn't go into depth on our particular use case, but this full-text indexing is just a small part of a larger filtering system. Having the index in Redis meant that we could execute the entirety of the query (which may factor in users, tags, modification dates etc) in Redis via union/inter/diff operations.

Could we have learnt and used another technology? Sure, and we may still do so. For now this solution works well for us.

Should other people use this technique? Like always, it depends on the situation.

pbrumm · on May 5, 2010

I have been looking into this area myself recently and this is great. My dataset is in the 300 - 500 item range and needing to include a full search engine seemed like overkill especially when I am already using redis.

I will probably go with sphinx for it's other capabilities but this sounds better than the other redis search implementations that are out there.

alexro · on May 5, 2010

Redis rocks, but using it as a search engine does not rock at all IMO. For one thing - if the item changes then you have to update all the keys that hold references to that item.

EDIT: it's fine if your items are very small or do not change often

adamcharnock · on May 5, 2010

Yes, there is a Caveat that if some words are removed from the database then you will need to reindex for that to be taken into account. However, in the case of bug tracking (which is largely what we are doing), data tends to be added rather than removed. Plus re-indexing happens on a project bases, not a global basis.

Also, for our use case it is not necessarily a bad thing that removed words may cause an item to be shown in the results. After all, just because a paragraph was removed from a bug description doesn't necessarily mean it is invalid search fodder (bearing in mind that, in our case, full-text search is just one way of filtering).

ryanpetrich · on May 5, 2010

Wouldn't you have to update any index you have whenever the content changes anyway (or lazily at some point thereafter)?

Updating the index should be a simple as calculating the old metaphones, removing from the index, and then reinserting the key based on the metaphones of the new content (multi/exec would is perfect for this)

petercooper · on May 5, 2010

It's also worth noting that the techniques shown could also be used for tagging, which would involve a lot less indexing in general (and search relevancy wouldn't be an issue either).

binaryten · on May 31, 2010

wow! lots of ifs, maybes, projections and conjectures. It would be nice to see someone give a concrete example when they give an opinion. Seriously, if you disagree, for example, and it is based on provable data, please share. Lloyd Moore.

mks86 · on June 23, 2010

Agreed. And all the people who keep harping on with recommendations of existing fts engines - who about providing some benchmarks regarding real-time search speeds. Thats one area where lucene (and solr using it) comes a bit unstuck - if you have fast changing dataset, its hard to keep up with storing and indexing it and thats where the redis approach with it high speed writes wins hands-down. Of course you could use lucenes RAMDirectorys and then create code to persist them back to disk, but then you're just re-inventing the redis wheel.

And from what I have read of Sphinx it likely has the same issues. Plus it seems to be aimed only at the lamp crowd - not very friendly if you are not using php+mysql already.