Data Catalogs Done Right#
piveau is an open source metadata catalog solution. It is highly scalable and covers the essential life cycle of your metadata: harvesting, storage and quality assurance.
piveau was designed and developed around Semantic Web technologies, the W3C standard DCAT and the European standard for Open Data DCAT-AP. It closes the gap between formal metadata specifications and their application in production. piveau puts a strong emphasis on Open Data and is a leading solution for public administrations and non-profit organizations to publish interoperable and flexible metadata catalogs.
Background#
Datasets#
It is customary in data management to divide data into individual chunks, so called datasets. A dataset holds data about a certain topic. This could be for example the demographic development of a country over a certain period of time or the number of people who have been using the public transportation system of a city during the last months. A dataset contains two things:
- information about the data itself ("metadata"), such as the time the dataset was created or changed, a title and a description
- distributions which contain the actual data, they are mostly presented in the form of XLS, CSV or other file formats
DCAT-AP#
One of the most widely adopted standards for the description of datasets is DCAT and its extension DCAT Application profile for data portals in Europe (DCAT-AP). The latter adds metadata fields and mandatory property ranges, making it suitable for use with Open Data management platforms.
piveau Components#
Piveau is based on a microservice architecture and a custom pipeline system, facilitating a flexible and scalable feature composition.
piveau hub#
Hub is the central component to store and register the data. Its persistence layer consists of a Virtuoso triplestore as the principal database, Elasticsearch as the indexing server and a MongoDB for storing binary files.
piveau consus#
Consus is responsible for the data acquisition from various sources and data providers. This includes scheduling, transformation and harmonization.
piveau metrics#
Metrics is responsible for creating and maintaining comprehensive quality information and feeding them back to the Hub.
piveau pipeline (PPL)#
The piveau pipeline can be imaged as a data processing chain which is described by a plain JSON document with a list of segments. These segments correspond with steps that are performed by the piveau services. Every segment includes at least meta-information, targeting the respective service and defining the consecutive service(s). The entire descriptor is passed from service to service as state information. You can learn more about the piveau pipeline in our developer guide.
How is Piveau used?#
The piveau codebase is licensed under Apache 2.0 and can be found in our central GitLab repository.