Feeding a monster – Solving a problem with bulk data importing

For some time now the UK Data Service instance of DKAN has been populated manually one dataset at a time. This was until a module known as feeds was discovered that could bulk upload a set of datasets all in one go.

The feeds module itself takes in a selection of different inputs but we configured it in order to take a csv file and use the information it contained to bulk create a collection of datasets. This csv file holds all the data needed to create each dataset from title and topics to things like the information to create the maps.

Unfortunately, we have discovered a flaw in this plan when it comes to resources. While the DKAN software can store data internally, some of the files we handle are much larger than it can handle. For this reason, it was decided early in the implementation that files would be stored on a separate file server.

For manual input of these resources this is no problem. We simply add the file as a remote file, linking to it via a link. This makes it fit with the correct styling and allows access to any previews of the data that there may be such as the csv previews.

Example of the resources

When it comes to the feeds importer however there has been a problem. In order to fill in the information for each section a resource requires we link to each field so for example the csv may have the first value be title and that links to the title field. For the files we have 3 options.

The first uploads the file to the DKAN database which as explained earlier we cannot do.

The second option is linking to data via a URL. This seems like it would be the correct option but it only provides a direct link to the file, meaning none of the data can be previewed and if a user attempts to preview the data it immediately tries to download it instead which isn’t the best idea.

Which leaves us with the third option. This is the option that we use for the manual resource creation and is done via linking to a remote file. This doesn’t quite work the same way when you try to do so via feeds though. For some reason, this option tries to upload the file to the DKAN database just like the first option, which it really shouldn’t do.

So, the question now is how do we fix this problem? It’s certainly not an easy fix and after much trial and error there has not been much progress at first glance. When you take a step back though there has been quite a bit worked out.

For one, this issue began as a simple 500 error on the site. For those who don’t know this sort of error doesn’t really tell you anything. We began trying different combinations of fields that we were uploading to the bulk importer and narrowed it down to the actual resource causing the issue.

More trial and error and we soon discovered the error had to do with the size of file we were trying to add. This initially didn’t make much sense as these files were stored off site and just linked in so why would file size cause any problems? However, if you consider the fact that these files are being uploaded to the DKAN database instead of just being linked to it suddenly makes perfect sense, especially if you know the file size limits for uploads.

Unfortunately, that is about as far as we have got to finding a solution. We know where the problem is so that’s half the fight but the issue persists. So where do we go from here? There are two avenues we are pursuing.

The first is continuing what we have been doing, trial and error testing as much as we can with the feeds module and see if there is another way we can get these resources bulk uploaded.

Our second option is one that is always recommended in these sorts of situations. Asking for help. We have reached out to the developers of both DKAN and the feeds module with our problem and hopefully somebody there may know how to fix it and if not, they should be able to point us in the right direction at least. Who knows? This might be a bug that they need to fix. It certainly seems like there are other users of the software with similar issues when it comes to this module.

Leave a Reply

Your email address will not be published. Required fields are marked *