StatDCAT-AP: A common metadata layer for making statistical data

StatDCAT-AP: A common metadata layer for making statistical data more visible,
open and linkable
Makx Dekkers ([email protected]) 1,
Chris Nelson ([email protected]) 2,
Marco Pellegrino ([email protected]) 3,
Nikolaos Loutas ([email protected]) 4,
Norbert Hohn ([email protected]) 5,
Vasilios Peristeras ([email protected]) 6
Keywords: linked open data, data portals, catalogue, statistics, DCAT-AP, RDF, SDMX.
1.
INTRODUCTION
Open data portals have been established throughout Europe. It has been observed that as
statistical data is of great interest, for example for decision-making and research
purposes, the catalogues of open data portals include numerous statistical datasets. The
StatDCAT Application Profile (StatDCAT-AP) is an extension of the DCAT-AP open
data standard and supports the integration of the descriptive metadata of statistical
datasets in the catalogues of open data portals, hence improving the visibility and
discoverability of statistical datasets.
The StatDCAT-AP has been developed by a working group co-chaired by Eurostat and
the EU Publication Office, within the framework of the ISA2 programme of the European
Commission. The creation process of the new specification was open, transparent, visible
to the public, and involved the main stakeholders to reach consensus in an open
collaboration. This collaborative work took place in a wider context, both on the
European level with the Directive on the re-use of Public Sector Information, and on the
global level with the G8 Open Data Charter. At the same time, it applied and exploited
the technical standards developed by W3C towards a globally interoperable environment
of Linked Open Data.
Building upon these two pillars, on one hand subscribing to the organisational goals to
open up public data for reuse, and on the other hand applying the emerging technologies
that facilitate linking data together, StatDCAT-AP aims to improve the opportunities for
discovery and reuse of statistical data from the wide audience using open government
data portals. In this context, the use of transformation mechanisms allows organisations
using existing statistical standards for data and metadata exchange, such as SDMX, to
align their standard with the StatDCAT-AP in a much easier manner.
After a period of public review in summer-autumn 2016, StatDCAT version 1 was
published at the end of 2016 and has been endorsed by the EU member states in the
context of the ISA2 Programme.
1
AMI Consult Sàrl, BP 1028, 1010 Luxembourg, Luxembourg
2
Metadata Technology Ltd, 46 Bridge St, Godalming GU7 1HL, UK
3
European Commission, Eurostat, Joseph Bech building 5, Rue Alphonse Weicker, Luxembourg 2721, Luxembourg
4
PwC EU Services, Woluwedal 18, Sint-Stevens-Woluwe, 1932, Belgium
5
European Union Publications Office, Rue Mercier 2, Luxembourg 2985, Luxembourg
6
International Hellenic University, School of Science and Technology, 14 km Thessaloniki-Moudanion, Thermi, 57001, Greece
1
2.
BUILDING ON THE DCAT-AP
The DCAT-AP is a specification based on W3C's Data Catalogue vocabulary (DCAT)
for describing public sector datasets in Europe [1]. The development of DCAT-AP was a
joint initiative of DG CONNECT, the EU Publications Office and the ISA Programme.
The specification was elaborated by a multi-disciplinary Working Group with
representatives from 16 European Member States, European Institutions and the US.
The DCAT-AP data model includes the following main entities:




The Catalogue: this represents a collection of Datasets. It is defined in the DCAT
Recommendation as “a curated collection of metadata about datasets”.
The Catalogue Record: DCAT defines this as “a record in a data catalog, describing a single dataset”. The Catalogue Record enables statements about the
description of a Dataset rather than about the Dataset itself.
The Dataset: this represents the published information. It is defined as “a
collection of data, published or curated by a single agent, and available for access
or download in one or more formats”.
The Distribution: this, according to DCAT, “represents a specific available form
of a dataset. Each dataset might be available in different forms, and these forms
might represent different formats of the dataset or different endpoints. Examples
of distributions include a downloadable CSV file, an API or an RSS feed”.
Figure 1 - DCAT main entities
The basic use case of DCAT-AP is to make public sector data better searchable across
borders and sectors, by enabling a cross-data portal search for datasets. The cross-data
portal search is enabled by different actors. Metadata brokers exchange the descriptions
of datasets created by data providers on one or more data portals. There are two enabling
conditions behind this metadata flow. First, the data portals maintain a data catalogue
including a collection of datasets and make the description metadata of the datasets in
their collection freely available. Second, in order to maximise the interoperability, these
descriptions should adhere to the specifications of the DCAT-AP for metadata. Thanks to
the two conditions, a metadata broker can harvest catalogues of metadata from data
portals and delivers the description metadata in a validated and harmonised manner to
data consumers.
2
This is shown in Figure 2.
Figure 2 - DCAT-AP basic use case: enable a search for datasets across various data portals
The data model
of version 1.1 of DCAT-AP
is
available at
https://joinup.ec.europa.eu/system/files/project/dcat-ap_version_1.1.pdf. The full version
of the application profile is posted on Joinup, the collaborative platform of the European
Commission funded by ISA2 Programme7.
3.
WHAT IS THE STATDCAT-AP
The StatDCAT Application Profile is an extension of the DCAT-AP, whose purpose is to
achieve better integration of the descriptive metadata of statistical datasets in the
catalogues of open data portals, hence improving the visibility and discoverability of
statistical datasets. It is a common layer for the exchange of statistical metadata for a
wide range of dataset types. This creates the opportunity for professional communities to
hook onto the emerging landscape of interoperable portals by aligning with the common
exchange format.
StatDCAT-AP defines a small number of additions to the DCAT-AP model that are
particularly relevant for statistical datasets. Given that there are many statistical datasets
that are of interest to the general data portals and their users, it is likely that by
recognising and exposing the additions to DCAT-AP proposed by StatDCAT-AP general
data portals will be able to provide enhanced services for collections of statistical data.
The additions to the DCAT-AP concern a number of requirements for the description of
statistical datasets, as listed below:



Attributes and Dimensions:
o stat:attribute: Attributes enable specification of the decimals, any scaling
factors and metadata such as the status of the observation (e.g. estimated,
provisional).
o stat:dimension: Examples of dimensions include the time to which the
observation applies, or a geographic region which the observation covers.
Quality:
o dqv:hasQualityAnnotation: A statement related to quality of the Dataset,
including rating, quality certificate, feedback that can be associated to
datasets or distributions.
Visualisation:
7
https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-ap-v11.
3

4.
o http://purl.org/dc/type: This property is the nature or genre of the resource.
The property is to be used to indicate the type of a Distribution, in
particular when the Distribution is a visualisation.
Other extensions such as an expression of the number of data series or unit of
measurement.
THE FUTURE: HOW TO PRODUCE STATDCAT-AP METADATA
StatDCAT-AP focuses on metadata elements that contribute to data discovery,
encouraging the use of common controlled vocabularies and the re-use of metadata from
existing repositories.
In the recent past, seven international organisations that are producing and coordinating
the dissemination and sharing of statistical data, including Eurostat, defined and adopted
the SDMX standard for data and metadata exchange, which is now an ISO standard (IS17369). By harmonising the metadata descriptions provided by SDMX (e.g. data
structures, standard code lists, quality descriptions and methodology) and open data
standards, both worlds get better connected, improving at the end the discoverability of
statistical datasets.
Therefore, StatDCAT-AP also includes a section describing the mapping of StatDCATAP to the SDMX Information Model. This is achieved by means of schematic diagrams
of the SDMX Information Model and through a worked example where the SDMX-ML
content is mapped to the classes and properties of DCAT-AP.
Figure 3 - StatDCAT-AP Model mapped to SDMX Model Classes
The intent of this mapping is twofold:
1. It enables those organisations that are using SDMX to know which metadata
structures to use in order to create StatDCAT-AP directly from existing SDMX
metadata repositories (such as an SDMX Registry).
4
2. It enables organisations that wish to use SDMX structural metadata as the format
for a Transformation Mechanism to know which SDMX element or attribute
maps to which StatDCAT-AP class or property.
A dissemination chain based on SDMX data descriptions is also able to produce
StatDCAT-AP descriptions through a simple transformation.
The StatDCAP-AP specification contains more technical documentation about these
aspects, which are relevant for organisations using SDMX infrastructures. SDMX is one
of the main standards currently in use in the statistics field and this explains the focus on
the SDMX mappings. Nevertheless, we actually expect more transformations to become
available in the future, as the architecture of the StatDCAT-AP transformation
mechanism could be easily used for DDI or CSV transformations. Some examples and
pilot implementations are expected to be produced and documented in the near future.
The work for the development of StatDCAT-AP was conducted in a transparent manner,
publicly visible and interactive. The development was facilitated and moved forward as a
result of the establishment of the StatDCAT-AP working group and the involvement of
the main stakeholders towards reaching consensus in an open collaboration. The same
open group remains responsible for the maintenance and future revisions of the
specification under the process set up and led by the ISA2 Programme.
5.
REFERENCES
[1] Fadi Maali, Richard Cyganiak, Vassilios Peristeras, Enabling Interoperability of
Government Data Catalogues, Lecture Notes in Computer Science, Vol. 6228, pp.
339-350, Springer, 2010
[2] European Commission. ISA – Interoperability Solutions for European Public
Administrations. http://ec.europa.eu/isa/about-isa
[3] European Commission. ISA – DCAT Application Profile for data portals in Europe.
https://joinup.ec.europa.eu/asset/dcat_application_profile/home
[4] StatDCAT-AP: https://joinup.ec.europa.eu/asset/stat_dcat_application_profile/home
[5] SDMX: https://sdmx.org
[6] DIGICOM (European Statistical System's project for Digital communication, User
analytics and Innovative products): http://ec.europa.eu/eurostat/web/ess/digicom
[7] EU Open Data Portal: http://data.europa.eu/euodp
[8] European Data Portal: https://www.europeandataportal.eu
5