Skip to content

Dataset Similarities

Introduction#

When browsing for datasets it may be of interest to users what datasets are similar to a particular one. The Dataset Similarities service fingerprints a combination of a dataset's title and description of each dataset using the TLSH algorithm. One file is generated for each catalogue, into which both a dataset's URI and the corresponding hash value is written.

Incoming dataset URIs can then be looked up in the parent catalogue's file. Next, the respective hash value is compared to other dataset's hash. This allows retrieval of the most similar datasets.

API#

The service is not a pipe module. Rehashing of datasets can be triggered manually or via cron job. Similar datasets can be retrieved via a dedicated endpoint.

Key Technologies#