Bill Caraher in his New Archaeology of the Mediterranean World blog has mentioned this blog frequently. Lately he has pondered what to do about his participation in a larger field project in the Princeton Polis Expedition. More specifically, he is trying to address how much his small study group should invest in its own data structure which may or may not be compatible with the data set of the project as a whole.

“I have been diligently reading John Walrodt’s Paperless Archaeology over the past few weeks. This blog documents in detail how a project implemented their digital workflow. From what I have seen so far, the tools that they developed and deployed served to facilitate their ongoing, in the field, research (although I am sure that there are provisions for archiving the data in a responsible way).”

He is correct, of course, in that this blog is currently focused on ongoing field research. There are things to say about data repositories, but I haven’t gotten there yet. My main focus on PARP:PS is data collection and immediate consumption and analysis of that data for preliminary publication.

It might help to realize that there are two different things at play: datasets and databases. I create databases to manage my datasets. The datasets might vary slightly from project to project and region to region, but are fairly interoperable. The variation that does exist is primarily a reflection of the variation that you get in survey/excavation/finds processing techniques from project to project. I can export the contents of any of the databases that I manage into a few dozen or so tables (PARP:PS currently has 35 tables) that can be key-linked using any database tool available.

But the database itself in its current FileMaker form does much more than store the data. We use it to view and summarize the data in a number of different ways. We can view all material from an single SU (Stratigraphic Unit), from a single phase (what Bill refers to as “level” in the Princeton Polis project), or a whole trench. Since we also have defined rooms and properties, we can add those to the database and view everything in those contexts as well. The database imports the images and creates the find numbers that we use to avoid data entry mistakes, and has validation routines to verify that hand-entered data is correct.

To put it another way: while my database might not be useful for many projects other than PARP:PS, I am pretty certain that my dataset will be useful for comparison to other projects at Pompeii or in Italy in general.

Getting back to Bill’s dilemma. His two options, as he expressed them, are to “develop a data structure best suited to answer our immediate research questions…On the other hand, I could imagine a data structure (undoubtedly more complex) best suited to preparing the Polis data for some form of digital publication (or at least archival storage).  Few projects in the Eastern Mediterranean with a Byzantine focus have made their data publicly available. In this regard, the Polis data could be an important step toward making stratigraphic, typological, and chronological data from the Byzantine period available in digital form.  At the same time,the two Early Christian churches represent just one part of a much larger and more complex site. Taking the time to produce a thorough and well-structured dataset could be a fool’s errand if it ends up being incompatible with other work ongoing at the site or finds very few comparable datasets elsewhere in the region.”

From my perspective this is an easy decision to make, although I haven’t seen any of the data. The more attention given to the structure of the data set is directly related to your ability to use distinct parts of that data set for analysis. It is my experience in fieldwork that inelegant solutions (Bill’s term) exist because no one has spent a great deal of time and energy to produce a more elegant one. Sometimes a project is looking for someone like Bill to show them the benefits of a better designed data structure. I have been on both sides of that conversation: I have created data structures that have been adopted by the larger project and I have incorporated data structures that others want to use in my own. I find that they key is to prove the benefits of the new solution and the conclusion becomes obvious. The issue of comparable datasets outside of the project isn’t, of course, Bill’s problem. If he is the first one to make such a tool available, and he publishes his data structure, it is up to those following him to try to make their data comparable.

Much of this is to say that I am shifting to posting about databases for the next few weeks. I will post a clone of the PARP:PS database in its current form (it is, of course, unfinished) as well as the files necessary to put the database on an iPad using FMTouch. The database is a complex beast and I don’t expect people to understand it right away, but I do plan on a series of posts explaining parts of the database in more detail.

Advertisements