First off, thanks to John for letting me post here. I do hope to be back as the work described below progresses.
In May and June of this year I spent just over two weeks initiating the online publication of the Kenchreai Archaeological Archive (KAA). This is an initiative of the American Excavations at Kenchreai, which conducts its work via a permit from the Greek Ministry of Culture and under the auspices of the American School of Classical Studies at Athens. Joseph Rife of Vanderbilt University is the director and I’m grateful for his permission to publish the results of this collaborative project. The focus of my work is the written and visual documentation of the excavations carried out at Kenchreai in the 1960s by the University of Chicago. These records are now in the Isthmia Museum, along with the objects that the project saved.
Cutting to the chase after the above preliminaries, I am modeling KAA using the Resource Description Format (RDF) in combination with the principles of Linked Open Data (LOD). Before offering an introduction to RDF, I’ll say that I’m using it because RDF gives me a simple and robust structure for describing the highly variable archaeological information that I am discovering in the extant records.
Superbrief intro to RDF
What is RDF? Simplistic answer is that it’s a W3 standard for encoding information, one with a formal description at http://www.w3.org/TR/rdf-concepts/ . More usefully, RDF has at its core the concept of a triple. For its part, a triple is a three part statement consisting of:
- A subject: what you are talking about.
- A predicate: the type of information that you’re saying about the subject.
- An object: the value – meaning content – of that information.
Informally, the phrases:
- “Sebastian Heath”
- “is a”
- “human”
Could be understood as the RDF triple:
- Subject: me (“Sebastian Heath”)
- Predicate: assertion of nature (“is a”)
- Object: human being (“human”)
Or:
- Augustus
- held the office
- Roman Emperor
The last example becomes more interesting when we replace the words with web addresses:
I’ve now used a set of publicly available web addresses to construct a triple about an historical individual. In doing so I’ve merged the ideas of Linked Open Data (LOD) into this discussion of RDF.
LOD is a set of best practices that suggests using URIs – here meaning well-constructed and stable web addresses – to identify publicly accessible resources. It further suggests that the information available at those address should be machine readable.
I’m now a long way from Kenchreai. Before getting back on track, here are some links to information about RDF and LOD that readers may find useful:
RDF and LOD at Kenchreai
OK, back to Kenchreai. Upon my first diving into the 1960s notebooks, I was pleased to see that the project was very organized about creating identifiers for the archaeological phenomena they were encountering. “Archaeological phenomena” is my fancy term for things like “trench”, “layer”, “object”, “sherds found together”, etc. I happily note that I am at the early stages of understanding how these ideas were manifested at the site in the 1960s so that what follows here is highly preliminary and subject to change.
To give an example by way of a series of steps.
- The Area E notebook for August 4, 1963 reads in part, “At level 0.90 m we started putting all sherds in [box E121].” You can see an image of that page via http://kenchreai.github.io/kaa/notebook-page-e-i-038 . Click thru on the image for more detail.
- From box E121, a sherd was pulled and sent in to be inventoried.
- When inventoried that sherd was assigned the ID “KE 670”. You can see the relevant entry in the inventory book via http://kenchreai.github.io/kaa/KE0670 .
- As was the practice at Kenchreai (and at other American excavations in Greece), KE 670 was also assigned a “subject number”, in this case “P 176”. This indicated that it was the 176th piece of pottery inventoried.
I could go on, but I hope it’s clear that there a lot of RDF triples implied in the above narrative. Taking KE 670 as our staring point, some of them are:
- KE0670
- type
- Inventory number
- KE0670
- is part of
- box-e0121
- KE0670
- is the same as
- P0176
If you’d like to see everything KAA is saying about KE0670 (to use the fully padded version), go to http://kenchreai.github.io/kaa/KE0670 . But please understand that that’s the temporary location of KAA. I’ll report its permanent address when that’s available.
You may have noticed that the above pseudo-RDF makes reference to other Kenchreai identifiers. One of them is box-e0121, with KE0670 being said to be “is part of” that. KAA in turn says the following about box-e0121:
- box-e0121
- type
- excavation-box
- box-e0121
- is part of
- trench-e-ii-x-1
And again in turn:
- trench-e-ii-x-1
- is part of
- area-e
And yes, there are URIs for those identifiers: http://kenchreai.github.io/kaa/box-e0121 and http://kenchreai.github.io/kaa/trench-e-ii-x-1 .
I can summarize the above by saying that: KAA is establishing web-based equivalents of the Kenchreai identifiers and is using RDF triples to indicate relationships between those identifiers.
The above means that I think I have a single conceptual structure, the RDF triple, that I can use to represent all information inherent in the materials – both written, visual, and physical (meaning the objects themselves) – now stored in the Isthmia museum. As implied above, I’m at the stage of putting these ideas to the test.
An additional point: because I’ve worked to choose sensible strings of characters for each ID, it’s easy to turn them into URIs. You’ve seen some already, here are some more examples:
Note that on each of those web-pages, you can click on the identifiers it references to see what KAA says about those. Furthermore, and this is important, you can scroll to the bottom of each page and click the “as rdf” link to get a machine readable representation of the data represented at that web address.
Querying the Kenchreai RDF
Because KAA is a list of triples, it’s easy to make that list available. For now see:
for two versions of all current triples.
I’m not going to go into the details of the format of those files other than saying that they’re “raw rdf”. I’ve made them available as a convenience for readers, but also so that I can demonstrate a simple query into this dataset.
“Query” is just another fancy term for “extract useful information from some data.” Part of the RDF suite of technologies is a language for describing such queries. It’s called SPARQL and its details go beyond the scope of this post.
My use case for now is finding all inventory numbers that are said to be part of “Area E”, which is just the designation for a part of the site where the project excavated. This is an interesting problem because, as you may have noticed, KAA does not explicitly say that an inventory number is part of a particular area. But I do plan to assert that inventory numbers are part of excavations boxes (when they are), and that excavation boxes are part of trenches, and then that trenches are part of areas.
The “is part of” relationship is indicated by the RDF predicate “dc:isPartOf”. “dc” stands for “Dublin Core”, which is a set of “core” terms for describing data. As in, I have a standardized way for expressing the logical relationship that one KAA identifier is part of another. Read more about “dc:isPartOf” at http://dublincore.org/documents/dcmi-terms/#terms-isPartOf .
So here’s an example of a SPARQL query that finds all inventory numbers that are part of “kaa:area-e”:
PREFIX dc: <http://purl.org/dc/terms/>
PREFIX kaa: <http://kenchreai.github.io/kaa/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?kenchreai_id
FROM <https://dl.dropboxusercontent.com/u/17002562/kenchreai.rdf>
WHERE {
?kenchreai_id ?p kaa:inventory-number .
?kenchreai_id dc:isPartOf+ kaa:area-e . }
The super important part is the ‘+’ symbol after ‘dc:isPartOf’ on the last line. That will cause a SPARQL query engine to follow all dc:isPartOf predicates to see if an identifier is said to be part of ‘kaa:area-e’. Cleverly, this builds on the pseudo-RDF for KE0670 that I presented above.
As a convenience, I have set up a link to the SPARQL query-engine (ok, “endpoint” for those in the know) at http://sparql.org/sparql.html to run that query and return readable results. That link is http://sparql.org/sparql?query=PREFIX+dc%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0D%0APREFIX+kaa%3A+%3Chttp%3A%2F%2Fkenchreai.github.io%2Fkaa%2F%3E%0D%0APREFIX+rdf%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%0D%0ASELECT++%3Fkenchreai_id%0D%0AFROM+%3Chttps%3A%2F%2Fdl.dropboxusercontent.com%2Fu%2F17002562%2Fkenchreai.rdf%3E%0D%0AWHERE+%7B%0D%0A++%3Fkenchreai_id+%3Fp+kaa%3Ainventory-number+.%0D%0A++%3Fkenchreai_id+dc%3AisPartOf%2B+kaa%3Aarea-e+.%0D%0A+%7D+ORDER+BY+%3Fs&default-graph-uri=&output=xml&stylesheet=%2Fxml-to-html.xsl
Yes, it’s true; I have only done the data entry to show that KE0670 is part of Area E. But when I’ve done more work, you’ll be able to get the entire list.
Conclusion
So, yeah, I think it’s cool that I have a simple structure to encode all data types in a format that can be queried using standards-based third party tools. And I’m really glad that I can do this without having to define a separate table for each datatype. That’s what I would have to do had I chosen a relational model for KAA. So perhaps the major take away from this post is that RDF represents one way to overcome the shortcomings of the relational databases that are so prevalent on archaeological projects today. I doubt that all will be with me as to that opinion, since this discussion has been too brief to justify that conclusion. But I’ll return to this space when progress warrants it and look forward to an ongoing exchange on the ideas I’ve floated here.