Best practices regarding data integrity and authenticity when archiving (hosting) a resource in our CLARIN centre:
Data acceptance for ingestion
- Data provided by the data depositor is only accepted for ingestion ...
- Only non-propietary, text-based data formats are accepted (as is usual in this field) - this facilitates long-term readability and preservation of the data.
The integrity of the data is fostered by using checksums (MD5) in Fedora. There is also a version control mechanism in the Fedora Commons backend. CMDI metadata (according to ISO 24622-1) are represented as a data stream within Fedora Digital Objects, and as such they can be version-controlled like all other object data.
An integrity and quality check of the data and the metadata is brought about semi-automatically, e.g. well-formedness and validity can be checked automatically for XML metadata, but they are also manually probed in order to check that descriptions actually make sense. The object data are tested for syntactic correctness if possible, depending on the data type and format. For this purpose, a front-end is used that helps creating valid CMDI metadata using components and profiles stored in the Component Registry.The CMDI creation workflow is described on a wiki page.
- The data provided by the data depositor is considered to be fixed and immutable - in case of format transformations (by the data depositor), the modified version of the data is treated like a new submission, with a link to the previous version. Changes to primary data will always result in a new data stream or digital object and, accordingly, a newly registered and associated persistent identifier.
- In contrast, changes to metadata will generally not result in a new PID being registered. However, we make use of the built-in Fedora-internal versioning mechanism in order to keep track of changes to the CMDI metadata files. Hence,respective changes can still be traced and old versions remain accessible at least in principle.