Version 11.1 by adavison on 2020/08/05 12:20

Show last authors
1 == Introduction ==
2
3 Computational provenance is a record of all the steps in a computational scientific workflow, including the code that was run, input data, the computational environment (hardware, OS, compiler versions, library version...), the person who performed each step, and output data.
4
5 Capturing computational provenance facilitates:
6
7 * reproducibility of results
8 * management and tracking of workflows/projects by the scientists/engineers involved
9 * evaluation/review by other scientists and engineers
10
11
12 == Standards ==
13
14 The [[W3C PROV standard>>https://www.w3.org/TR/2013/NOTE-prov-overview-20130430/||rel="noopener noreferrer" target="_blank"]] provides a data model and related tools for provenance interchange on the web. The following diagram shows the three base classes of the PROV data model: Entity, Activity, and Agent. These three classes form the basis for the representation of provenance in the EBRAINS Knowledge Graph: every node in the KG has a type which is a subclass of one of these base classes.
15
16 [[image:starting-points.svg||alt="The three Starting Point classes of the W3C PROV ontology and the properties that relate them."]]
17
18 == Storage of provenance in the Knowledge Graph ==
19
20 We present here the current schemas for representing (a) data analysis and (b) simulations in the Knowledge Graph. These schemas will need to be extended to cover neurorobotics simulations, and probably a more explicit representation of pipelines/workflows (the chaining together of multiple analysis / simulation stages) will be needed.
21
22 [[image:Workflow provenance in the EBRAINS KG.svg||alt="KG schema for data analysis"]][[image:Workflow provenance in the EBRAINS KG-2.png||alt="KG schema for data analysis"]]
23
24 [[image:Workflow provenance in the EBRAINS KG.png||alt="KG schema for simulation"]]
25
26 (note that the diagrams do not show Agents; the person who launched each analysis/simulation activity is linked to the activity with a ##wasAssociatedWith## connection).
27
28 (% class="box warningmessage" %)
29 (((
30 TODO: insert or link to the detailed schemas for each type
31 )))
32
33 == Tools for automated capture of provenance ==
34
35 (% class="wikigeneratedid" %)
36
37 Issues to discuss:
38
39 * on different systems:
40 ** HPC systems
41 ** neuromorphic systems
42 ** Jupyter notebooks
43 ** users' own computers
44 * prospective/pre-emptive vs run-time provenance capture
45 * capture of metadata vs capture of artefacts
46
47 == Communication between computer systems and the KG ==
48
49 Two issues arise:
50
51 (i) fine-grained provenance information may need to be obtained on compute nodes, which may not have network access;
52
53 (ii) failures of provenance upload should not cause the workflows to fail;
54
55 An overall solution for both of these issues would perhaps involve a local cache and later synchronization.
56
57 == User interfaces for browsing, visualizing, and searching provenance information ==
58
59 (% class="box infomessage" %)
60 (((
61 DISCUSSION NEEDED: integrate visualization of prov information into KG Search UI, and/or develop separate app?
62 )))