Auto-Complete Search

I was recently asked during a technical interview to design a Google-esque autocomplete search system. I designed a sensible solution but started to falter when it came to optimising and scaling, so I've created a short writeup below covering some potential improvements.


Simple in-browser autocomplete

The demo above is a simple demonstration of the use of a Trie for text storage. As new characters are typed the Trie is traversed and the first five child branches are returned as suggestions. As new queries are made they are added to the Trie.

A more robust and scalable implementation

When building a practical solution there are a number of additional considerations that should be made. This solution could involve requests being made to a server storing a more complex version of the trie in the demonstration.

How long are searches stored?

Storing searches for less time or placing more of a weighting on recent searches would allow for better autocompletion of searches for recent events.

Personalisation

A smaller Trie could be stored for the user and the results compared with the server response. This could allow for better autocompletion of frequent searchers for that specific user.

Localisation

Different Tries could be stored for different regions which would allow for better autocompletion of local events or places.

Node values

For a more complex solution simply returning the first five bracnhes is not enough. To remedy this, each node could also store a score the five best paths from it. When new nodes are added the scores and corresponding top 5s would be recalculated - up to the root.

Read/write conflicts

This could easily lead to conflicts with reading and writing, although accuracy and consistency aren't top priorities could counteract some of the difficultues by using sampling. Instead of adding every new search we only add a smaller fraction, the specific fraction would depend on volume.

Scalability

At even moderate scales this solution would quickly exceed the memory of a single machine and so would have to be distributed. Some amount of replication would also ensure availability in the event one machine crashed.