Wiki source code of Provenance of simulation and data analysis workflows
Last modified by adavison on 2022/05/23 22:24
Hide last authors
| author | version | line-number | content |
|---|---|---|---|
| |
4.2 | 1 | == Introduction == |
| |
1.1 | 2 | |
| |
7.1 | 3 | Computational provenance is a record of all the steps in a computational scientific workflow, including the code that was run, input data, the computational environment (hardware, OS, compiler versions, library version...), the person who performed each step, and output data. |
| |
1.1 | 4 | |
| |
4.2 | 5 | Capturing computational provenance facilitates: |
| |
1.1 | 6 | |
| |
4.2 | 7 | * reproducibility of results |
| 8 | * management and tracking of workflows/projects by the scientists/engineers involved | ||
| 9 | * evaluation/review by other scientists and engineers | ||
| |
1.1 | 10 | |
| |
4.2 | 11 | == Standards == |
| |
1.1 | 12 | |
| |
7.2 | 13 | The [[W3C PROV standard>>https://www.w3.org/TR/2013/NOTE-prov-overview-20130430/||rel="noopener noreferrer" target="_blank"]] provides a data model and related tools for provenance interchange on the web. The following diagram shows the three base classes of the PROV data model: Entity, Activity, and Agent. These three classes form the basis for the representation of provenance in the EBRAINS Knowledge Graph: every node in the KG has a type which is a subclass of one of these base classes. |
| |
1.1 | 14 | |
| |
7.1 | 15 | [[image:starting-points.svg||alt="The three Starting Point classes of the W3C PROV ontology and the properties that relate them."]] |
| |
1.1 | 16 | |
| |
4.2 | 17 | == Storage of provenance in the Knowledge Graph == |
| 18 | |||
| |
7.2 | 19 | We present here the current schemas for representing (a) data analysis and (b) simulations in the Knowledge Graph. These schemas will need to be extended to cover neurorobotics simulations, and probably a more explicit representation of pipelines/workflows (the chaining together of multiple analysis / simulation stages) will be needed. |
| |
4.2 | 20 | |
| |
12.1 | 21 | [[image:Workflow provenance in the EBRAINS KG-2.png||alt="KG schema for data analysis"]] |
| |
10.2 | 22 | |
| 23 | [[image:Workflow provenance in the EBRAINS KG.png||alt="KG schema for simulation"]] | ||
| 24 | |||
| 25 | (note that the diagrams do not show Agents; the person who launched each analysis/simulation activity is linked to the activity with a ##wasAssociatedWith## connection). | ||
| 26 | |||
| |
10.3 | 27 | (% class="box warningmessage" %) |
| 28 | ((( | ||
| 29 | TODO: insert or link to the detailed schemas for each type | ||
| 30 | ))) | ||
| 31 | |||
| |
4.2 | 32 | == Tools for automated capture of provenance == |
| 33 | |||
| |
10.5 | 34 | |
| 35 | Issues to discuss: | ||
| 36 | |||
| |
4.2 | 37 | * on different systems: |
| 38 | ** HPC systems | ||
| 39 | ** neuromorphic systems | ||
| 40 | ** Jupyter notebooks | ||
| 41 | ** users' own computers | ||
| 42 | * prospective/pre-emptive vs run-time provenance capture | ||
| 43 | * capture of metadata vs capture of artefacts | ||
| 44 | |||
| 45 | == Communication between computer systems and the KG == | ||
| 46 | |||
| |
10.4 | 47 | Two issues arise: |
| |
4.2 | 48 | |
| |
10.4 | 49 | (i) fine-grained provenance information may need to be obtained on compute nodes, which may not have network access; |
| 50 | |||
| 51 | (ii) failures of provenance upload should not cause the workflows to fail; | ||
| 52 | |||
| 53 | An overall solution for both of these issues would perhaps involve a local cache and later synchronization. | ||
| 54 | |||
| |
4.2 | 55 | == User interfaces for browsing, visualizing, and searching provenance information == |
| 56 | |||
| |
10.4 | 57 | (% class="box infomessage" %) |
| 58 | ((( | ||
| 59 | DISCUSSION NEEDED: integrate visualization of prov information into KG Search UI, and/or develop separate app? | ||
| 60 | ))) |