The International Standard Content Code (ISCC) – why libraries, archives and museums should use it

diesen Beitrag auf Deutsch lesen

Today, libraries, archives and museums collect, create and distribute a wide range of digital content. For some time now, they have been ensuring that this content can be identified, referenced and permanently retrieved using persistent identifiers (PIDs). Identifiers such as DOI, URN, Handle or PURL are often used here.

These systems are established and widely accepted. They have become an essential part of modern scientific communication, as they make research results citable and research data referenceable, among other things. They all have in common that there is a central authority that monitors the allocation of the respective persistent identifiers (to a greater or lesser extent) and also provides and operates the infrastructure for resolving them to the specific digital object. As a result, these identifiers also enjoy a high level of institutional trust.

At the end of May 2024, ISO 24138:2024, the International Standard Content Code (ISCC) was adopted as an international standard from the family of identifiers. In contrast to the established identifiers, which practically identify an intellectual work as a product, e.g. a publisher‘s publication, the ISCC refers to the media file itself. The driving force behind the identifier is the foundation of the same name (ISCC Foundation), with Lambert Heller (TIB), among others, as a member of its Advisory Board since 2020.

Broad implications of the new identifier for digital content

The Foundation sees the benefits of the new identifier primarily in the creative industries. In reality, however, the new identifier will have a far wider impact on any organisation that wants to provide content on the internet that is worthy of protection, verifiable and integer in the age of deep fakes and the era of machine learning. Let’s take a look at why the standardisation of this new identifier should trigger more than the interested raising of an eyebrow at a GLAM institution.

Structure of an ISCC Code

Firstly, the characteristics and structure of an ISCC code must be considered. The special feature compared to Persistent Identifiers (PID): The code is not generated and managed by a central instance, but is generated from a media file. The media file can then be identified (in whole or in part) with the code generated in this way. The ISCC code can be generated with minimal cost and effort wherever a digital object is stored – be it the collection management of a museum, an institutional repository or the digitised collections of an archive. In principle, it does not matter whether the digital object in question is open access, restricted access or otherwise rights protected.

If you look at a generated ISCC identifier, you will see a hash value like ISCC:KUAG5LUBVP23N3DOHCHWIYGXVN7ZS. This code is made up of four blocks, each of which represents a (partial) identifier of the object and can therefore be used individually. The strength, however, comes from their interaction. The blocks are:

  1. Meta-Code: This ISCC code block is a similarity hash calculated from the object‘s embedded metadata, such as author name and work title. This is not a detailed description of the object with descriptive metadata according to library standards. In principle, however, the metadata used can be as detailed as desired and embedded as a data URI. Other identifiers such as DOI or PPN can also be used when generating the similarity hash for this block of the ISCC code.
  2. Content-Code: This code block is a similarity hash calculated using the tangible, visual or audible content components. For example, depending on whether the content is textual, audio or visual, these are the readable strings contained in text documents or the visible pixels in images. The same image, once as a JPEG and once as a PNG, will always have the same content code. If it has been minimally modified, it will have a similar content code.
  3. Data-Code: This code block contains a similarity hash generated from the raw binary stream of media files.
  4. Instance-Code: This is exceptionally not a similarity hash, but an exact cryptographic hash to confirm the data integrity of the object. (We have been using a similar MD5 checksum in the details of objects for decades, e.g. in many media or document repositories).

An example of similarity hashing with ISCC in action

What does it mean that metacode, content code and data code are vector-based similarity hashes? While with other hash functions even small changes in the original object lead to a completely different hash value, here (small) changes in the metadata, content or raw data lead to similar hash values, the distances between which can be calculated in vector space (see also the concept of fuzzy hashing). It is therefore possible to determine not only whether two objects are identical, but also how similar they are, and this at different levels of the codes. This type of application is not new, research has been going on for decades – the only thing that is new and important for our context is that a workable specification has been agreed internationally as a standard for an identifier. This is the only way to open up areas of application (briefly touched on below) that have not been achieved with existing proprietary implementations of this technology.

You can try this for your own objects in the ISCC Generator. As an example, let‘s have a look at the text of a court judgement in PDF, which is available once in the original (Figure 1) and once with a redacted name (Figure 2).

Figure 1
Figure 1
Figure 2
Figure 2

An ISCC is calculated for both documents:

ISCC:KACRKNK43HDXGP5SQNC5ZBTRWBTTCNP55HKLA4WGOC37YOWDWOWROVY (Original)
ISCC:KACRKNK43HDXGP5SQPD5ZBTRWBTTCDORSKBDWDUDORS2VVA7HGQNMGA (blackened)

The similarities between the two ISCCs can already be seen in the comparison of the ISCC metadata (Figure 3).

Figure 3

The differences can then be seen by comparing each code from top to bottom (Figure 4).

Figure 4

Both files have the value “Untitled” in the PDF metadata “Title”, so there is a similarity here (metacode). There is a very high similarity of almost 94 % in the content code, as the blackening in the second document is only slight. The similarity of the data code, on the other hand, is very low at 9 %, because the new PDF created after the redaction was generated with completely different parameters. The instance code, on the other hand, is completely different, as the two codes are completely different.

A special case: ISCC codes for determining signed statements on the origin and status of media objects

While existing persistent identifiers can assign a name to any object – e.g. a journal, a research organisation or a person – ISCC identifiers only make sense in relation to a specific digital object or its delimitable components. For example, an ISCC identifier could be calculated for a journal article or for each image of that article. This makes it clear that the ISCC is a pure object identifier that is not intended for resolving, i.e. accessing, the object in question. Rather, it complements the family of established persistent identifiers such as DOI or URN with aspects such as similarity and authenticity of content.

Assuming an additional digital certificate infrastructure (which is not as readily available as the ISCC codes themselves), the ISCC standard could help to verify the authenticity of content. Todd Carpenter, executive director of the US standards organisation NISO, points out that the industry consortium Coalition for Content Provenance and Authenticity (C2PA) has standardised the ISCC as one of several options for creating a “soft binding” via central registries between certain media files that contain provenance data and other important metadata according to the C2PA standard, on the one hand, and media files that can be shared on the Web without this embedded metadata and still be recognised thanks to the “soft binding”. AI applications for image processing and generation could also generate and use such information, for example to tag AI-generated content.

Another similar application of ISCC codes would be, for example, a machine-readable statement from an archive claiming the public domain status of a particular work under German copyright law as of a certain date. Dark archives, for whom it is too difficult to prove whether a particular work has already entered the public domain by virtue of copyright, could find out which of their own works have already been marked as public domain by an automated comparison with, say, Wikimedia Commons or another archive – if (in this example, Wikimedia Commons) they have already recorded their holdings using ISCC and made the resulting codes available in their respective search systems.

Discussing potential applications of ISCC for Libraries, Archives and Museums

The ISCC now makes it possible to compare or link digital objects in a decentralised way, i.e. no matter where the objects are stored, regardless of the context in which they exist. For example, the ISCC can be used to quickly determine similarities or even equivalence between two different objects with different DOIs. If ISCCs were stored at the DOI registrar, this could even be done directly there.

The strength of the ISCC in the context of scholarly publishing and cultural heritage lies not least in its ability to automatically identify a definable work retrospectively. As already mentioned, this possibility also exists for works that are not or only partially openly accessible. This is also in line with the logic of the FAIR principles, according to which we strive for the best possible discoverability and accessibility, even if we cannot provide general open access in every case.

In this way, works registered under different identifiers in the same or different reference systems can be marked as the same or similar. This allows, for example, a clustered view of the results of a search or database query, or the enrichment of existing citation records with additional metadata available elsewhere for the same work. Proprietary similarity hashes, such as those used for many years by database providers such as Google Scholar, ensure a significantly better clustering of identical works – without the resulting benefit being passed on to the use cases of other actors.

However, it must be emphasised that although the ISCC uses similar mechanisms, it is not a system for automatic content recognition (ACR – automatic content recognition technologies), such as the upload filters of large platforms on the web. The latter use large proprietary databases full of granular and digital fingerprints to compare content. For example, they can detect a copyrighted song that is heard for a few seconds in the background noise of an uploaded video.

Registries for ISCCs at central libraries such as the TIB, Europeana or institutions such as the Institut für Deutsche Sprache (IDS Mannheim) are also conceivable, in order to be able to prove the provenance of original digital content. Not only could research information infrastructures use ISCCs to help themselves and each other improve their services, but there are also conceivable use cases for a wide range of other actors. For example, machine learning models could use these ISCC codes to register their training data, thereby achieving a higher level of trust and reliability.

The source code required to generate ISCC-Codes is available as open source, so that any institution can integrate them into their repositories, document servers or research data archives. Of course, creating ISCCs for local content as well as building and maintaining the infrastructure to automatically generate and distribute these codes is a lot of work today. Until this becomes a standard component of relevant software systems, it will require development and data management resources with the appropriate budget. However, this is outweighed by the many potential benefits we have outlined in this article.

We would like to thank Sebastian Posth (ISCC Foundation) for his thorough review of the first draft of this blog post.

Bibliothekar. 🤓
Leitung Open Science Lab der TIB.
Folgt mir unter https://openbiblio.social/@Lambo //
Librarian. 🤓
Head of Open Science Lab at TIB.
Follow me at https://openbiblio.social/@Lambo

Gerrit Gragert

... arbeitet an der Staatsbibliothek zu Berlin, wo er den Bereich „IT-Services für die Digitale Bibliothek“ leitet.