Improving our search capabilities

When searching our digital resources site you may find being somewhat specific in your search terms doesn’t get you the correct results. Take this screenshot for example. We want to find any datasets containing age 37. It is very specific number that someone might want to search and yet here we get no results.

This is something we have been looking into and are trying to fix.

One of the ways we have tried to provide this search capability is by adding something we have taken to calling the hidden search field. This field is hidden to our users and would contain all the extra search terms that are not already picked up by things like our keyword, titles etc.

By adding this field to each dataset we could then index it as part of our search API so that once you search for one of these obscure terms like the age 37 example from before it actually returns the relevant datasets!

Unfortunately this process has hit a roadblock. The indexing capabilities of the standard DKAN search API seems to be limited to 1000 characters at a time. If a field goes over that 1000 character limit it gets confused and doesn’t index it at all.

One possible solution to this issue was to try and remove all the unwanted terms in the field. This included stop words (the words search engines like google ignore like a, and, the), keywords already indexed from other parts of the dataset and special characters for example lines, slashes and brackets.

Now removing all these things did lower the character counts on most of the test sample of datasets we were using however for some the length was still way over 1000 characters. One example describing occupation was over 2000 characters even after shortening.

That is where things are for the moment. One of our plans is to try using an alternative search API known as solar. Work on this is still ongoing and you can expect a further blog post in the future on any further progress.

