A problem that plagues any holder of large amounts of data is getting them all uploaded to a new site or updated if something is slightly off. Such is the problem we have been trying to tackle as of late.
Initially the idea was that to get these datasets uploaded or updated all in bulk we would use the DKAN API via some code in probably PHP to access the site. This proved rather fiddly and prone to breaking even in the small tests we tried when developing it. Surely a large data platform such as DKAN would have some sort of way to bulk upload a bunch of datasets all in one go that was easier than this?
Some research later and we discovered a built in module for DKAN known as feeds. The idea behind it is that is provides tools to build importers for importing data into your site. Exactly what we needed!
There are a few ways to set up an importer but we decided the simplest way to do so is by uploading a csv file containing all the relevant information on the datasets.
Each importer can be set up to do slightly different things. For example the bulk upload dataset importer shown above connects to our datasets content type, with each column in the csv mapping to one of the fields that a dataset usually takes. It then creates new dataset entries with the information it imports. By comparison the bulk update dataset importer does the same things but instead of creating new dataset entries it goes through and updates any datasets with a matching node to the one provided.
One limitation of the base feeds module is that it cannot handle adding to fields that take multiple entries such as keywords. It can add one entry but not multiple. To get around this there is another module you can add to DKAN known as feeds tamper. This module allows us to modify data in various ways before it gets saved. For our purpose here we only need it for one of its functions, explode.
Explode lets us split up a sequence of strings using a specified separator which in this case is a +. Adding this to the importer on the keywords section of the import allows us to put in multiple keywords (or phrases. It doesn’t have any issues dealing with spaces) by simply adding plus symbols between each keyword.
Now while this solution works rather well it still has a few hiccups we are looking to address. First of all when we try to set a group to connect to each dataset it simply does not work so the group has to be assigned afterwards manually.
We also discovered that while the licence is set initially if you go and edit the dataset manually it is set back to no licence instead. A minor annoyance but something to look out for.
Finally, there is the problem of geographic coverage location and granularity. For some inexplicable reason these two separate data fields are linked in some way, meaning if you try to update the both of them in one update they both get set to the same value. While annoying there is a temporary fix in that you import the datasets with only one of those fields then after importing you update them with the other field. It’s not exactly the best situation but it works for our purposes.
Hopefully with more testing and research these issues can be ironed out but even with these issues the feeds module seems to be a great way to get lots of data up on a DKAN based site and is certainly a lot faster than imputing all this information manually.