Data Modelling Workshop, 23-24 May 2022

By Sarah Middle, Duncan Hay and Alex Butterworth

Last month, the three of us met in Cambridge for a two-day workshop focused on modelling the core, meticulously cleaned data from the legacy SIMON (Scientific Instrument Makers, Observations and Notes) using an event-based structure and (in doing so) getting to grips with Arches, which will serve as the data management software for the next phase of the project.

Arches is an open-source web platform for managing heritage data created and maintained by the Getty Conservation Institute and the World Monuments Fund. Although originally designed for archaeological data, Arches affords the ‘Tools of Knowledge’ project several potential benefits: it provides a useful interface for managing and exploring heritage data of the type we are working with, presents a fully-featured (and potentially extensible) API that will allow us to create custom visualisations of the data, and has a wide community of institutional users worldwide. Of most immediate importance, it provides a means of modelling data with reference to a formal ontology, separately from but in direct relationship to the database design. This offers us a lot more flexibility than our current approach, in which modelling is tightly-coupled with the database schema, making alterations to the data model and the consequent database migrations an involved process.

To model our data, we are primarily using classes and properties from the CIDOC Conceptual Reference Model (CIDOC CRM), although we expect to integrate a range of other vocabularies and ontologies, both existing and novel, as we proceed. CIDOC CRM is an event-based model, developed for use in a cultural heritage context, which allows us to model entities (such as people and objects) in terms of events that have occurred throughout their lifespans. For example, rather than simply stating a person’s birth date, CIDOC CRM connects that person to a birth event, which are additionally connected to other pieces of information, such as place and time. Using a model like this means that we can draw out the detail in the SIMON dataset, in its full richness, and represent more nuanced links between the actors described within it, even before its enhancement with complementary datasets. However, as the original SIMON database has a very different structure, representing it accurately in this new event-based data model will involve an extended and painstaking process of drafting, reflection and review.

Our starting point was to create a new ‘resource model’ (the name Arches gives to the main entities/nodes in a database) for a Person, assigning them a surname and given name using the CIDOC CRM class of Appellation (E41). We also wanted to keep the original SIMON identification number for all the person data contained in the database, so added this as an Identifier (E42). To connect these to Person (E21), we used the property ‘is identified by’ (P1). Helpfully, Arches validates your data model against any external resources you are using, so the options available are restricted to a choice of classes and properties that result in a valid CIDOC CRM structure.

Birth (E67) and Death (E69) were a bit more tricky, partly because these classes are actually events with dates attached to them, and particularly because our prior work with the dates in SIMON has revealed them to be quite complex. While some dates are known, others are only estimated in relation to other events. In many cases, too, we have only a narrow date range when someone was known for certain to be alive (or worked, or ‘flourished’), from which it must be inferred that they were born before the beginning of the range and died before the end. Luckily, CIDOC CRM includes useful properties, such as ‘occurs before’ (P120) and ‘occurs after’ (P120i), which allow us to capture these nuances in our data model without giving the user a false impression of certainty.

The presence of various markers in the SIMON dataset to indicate confidence levels in some of the dates requires even more nuanced modelling. The main markers are ‘c.’ (an abbreviation for ‘circa’, indicating that this date is fuzzy, i.e. the event happened at about this time) and ‘?’ (providing an indication of uncertainty that this date is correct). In our data model, we added Confidence as a measurable Dimension (E54), which can either be classified as one of the types (E55) – ‘fuzzy’ or ‘uncertain’ – or can be assigned a probabilistic value (E60), which will be useful when we come to add computationally generated statistic data, later on. We have also provided for further information or explanation to be appended using a confidence note (E62). Representing levels of (un)certainty for Humanities data poses many challenges and there is currently no standard method of doing so, since the requirements depend so much on the constraints and methods of individual projects. Having extensively researched the background to work in this area, aspects of which we draw upon, our Confidence model is a provisional attempt at a solution, which currently works for our purposes but will remain under review as our data modelling progresses. We will post an expanded consideration of this work, in relation to both its conceptualisation and the project-specific work that it enables, as it develops.

By the end of the second day, our familiarity with CIDOC CRM had increased and we had become more adept at working with Arches. In some areas, such as our initial attempts at modelling organisations and apprenticeships, for example, the rigour of the process exposed conceptual challenges, leaving us with more questions than answers, which we will work through with the broader project team. Both the approach and the platform look promising, however, and we look forward to further populating our data model over the coming weeks and months, with updates on that process to follow.