Skip to content

Harvesting - piveau consus#

Consus is an extract, transform and load (ETL) like framework.

When you need to fetch data or metadata from a source, Consus provides you a high performant and high scalable solution based on microservices and container technology.


Concept#

The basic concept of Consus is that of a Pipe.

Technically speaking, a pipe is the orchestration of several modules, each module represents a step of processing data. An example Pipe is that of a harvester where data processing modules orchestrate to a chain, usually an importer, a transformer, and an exporter.

Pipe
A pipe is the chaining of data processing Pipe Segments. A pipe can be in two different states, definition and instance. Before a pipe can be executed, the definition needs to be "instantiated". Starting the pipe means passing an instance to the first segment.
Pipe Descriptor
JSON or YAML description of a Pipe.
Pipe Definition
The semantic content of a Pipe Descriptor. Usually, it contains some metadata about the Pipe and the chaining information of one or more Pipe Segments plus their configuration. It lacks the information about real connecting information, single execution information and usually any payload.
Pipe Instance
To execute a Pipe, the Pipe Definition must be instantiated to a pipe instance. Technically speaking, an instance is the Pipe Definition, applied for real addresses of segment implementations (Pipe Modules), execution information like run id and start time, an optional Pipe Payload, and optionally run specific segment configurations. You can then use the instance to start the pipe by passing it to the first segment.
Pipe Segment
A description of a single module, program or entity that is able to be part of a Pipe.
Pipe Payload
Data embedded in a Pipe Instance.
Pipe Module
An entity that implements a Pipe Segment.
Pipe Run
The execution of the Pipe. To start a run pass a Pipe Instance to the first Pipe Module.

In other words, a pipe must first be defined, then instantiated and finally this instance can be started.

Let's have a look on a minimal pipe definition. To define a pipe we use the pipe descriptor, either in JSON format or more user-friendly, in YAML format.

Minimum Pipe Definition

{
  "header": {
    "name": "minimum-pipe",
    "version": "2.0.0",
    "transport": "payload"
  },
  "body": {
    "segments": [
      {
        "header": {
          "name": "one-and-only",
          "segmentNumber": 1
        },
        "body": {
        }
      }
    ]
  }
}
header:
  name: minimum-pipe
  version: '2.0.0'
  transport: payload
body:
  segments:
    - header:
        name: one-and-only
        segmentNumber: 1
      body: {}

A corresponding pipe instance.

An example Pipe Instance

{
  "header": {
    "id": "1b1adbac-3867-4704-bd8d-9ade20d1f24b"
    "name": "minimum-pipe",
    "version": "2.0.0",
    "transport": "payload",
    "startTime": "2020-06-29T12:30:00Z"
  },
  "body": {
    "segments": [
      {
        "header": {
          "name": "one-and-only",
          "segmentNumber": 1,
          "processed": false
        },
        "body": {
          "endpoint": {
            "address": "http://example.com:8080/pipe"
          }
        }
      }
    ]
  }
}
header:
  id: 1b1adbac-3867-4704-bd8d-9ade20d1f24b
  name: minimum-pipe
  version: '2.0.0'
  transport: payload
  startTime: 2020-06-29T12:30:00Z
body:
  segments:
    - header:
        name: one-and-only
        segmentNumber: 1
        processed: false
      body:
        endpoint:
          address: http://example.com:8080/pipe

Installation#

A minimum Consus installation consists of following parts:

  1. At least one Pipe Module
  2. The Scheduler
  3. At least one Pipe Descriptor

Optionally, you can connect the modules to a ElasticStack instance for monitoring purposes and for a convenient frontend the piveau-consus-monitoring-ui component.

Pipe Modules#

Importer
piveau-consus-importing-rdf Import metadata from an RDF source
piveau-consus-importing-ckan Import metadata from ckan
piveau-consus-importing-oaipmh Import metadata via OAI-PMH protocol
piveau-consus-importing-sparql Import metadata from a SPARQL endpoint
piveau-consus-importing-socrata Import metadata from Socrata
piveau-consus-importing-udata Import metadata from uData
Transformer
piveau-consus-transforming-js Transforming data or metadata with JavaScript
piveau-consus-transforming-xslt Transforming data or metadata with XSLT
Exporter
piveau-consus-exporting-hub Export metadata to the piveau hub

The Scheduler#

Providing a Pipe#

Pipe definitions can be provided in two ways. Either from a git repository or from a file system.