Client side site search for Jekyll, Programming by Coincidence
The type of fuzzy search I have implemented will provide the following matches for the query string “cold pa”:
- ColdFusion Package management
- Scaffolding with sporadic spacing
If all the letters in your search term are found in the item to be searched, and in the order you have typed them, they will appear in results, regardless of exact word matches. You may find that you get results that you really weren’t searching for, but with good result scoring, the correct results should rise to the top.
In Sublime Text 2, it is used everywhere and is particular useful in finding files buried in directories. For example, if I want to find
./cfstatic/util/Base.cfc, I can type: ‘statutibas’ and my Base.cfc file will be found, and most likely be the first result. It may not look particularly intuitive (statutibas?!), but once you get the hang of it, it feels like a natural extension of your brain.
A little regex
The first step in solving the problem was to produce a regular expression from the user input. The expression is simple. All we want to do is match all of your input characters, in the order that you type them, anywhere in the string. For example, if the input is ‘cold pa’, we generate the expression:
Next, we want to be able to show the user what has been matched by highlighting the result. For the expression above, we need to generate a corresponding pattern with which to replace the matched string with a string that makes all the matched letters bold (I believe that using the b tag rather than strong is correct in this case, please correct me if I’m wrong!):
Scoring the results
I am scoring results by how narrowly they match the input. This is expressed as:
matchedSubstring.length - userInput.length
For the input, ‘cold pa’, “Cold pathology” would score 0 and “Scaffolding with sporadic spacing” would score 14. The closer to zero, the better the match.
Before getting in to the code, there is a small complication to consider here: a single string may have multiple matches to the regular expression. We are most interested in the most “narrow” match. My solution to this is a little brutal I think - I attempt the regex on decreasing substrings of the user input, breaking when no match is found.
My Levenshtein mistake, Programming by coincidence
As stated in my introduction, I made a classic ‘Programming by coincidence’ mistake. Before the scoring solution above, I was using the Levenshtein Distance string metric for scoring search results. This fancy sounding metric provides a score based on the number of ‘single character edits required to change one word into another’.
It was not until I looked to fully understand the algorithm, in order that I might write about it in this blog post, that I realised my mistake. Using it in this case was having no more effect than ordering the results by the length of the page or post title; a fancy waste of CPU cycles!
Generating my search collection