Friday, January 14

Thoughts on contracts and other documents

I have started working on markup for contracts to enable database functionality. Having spent a great deal of time (but not really) worrying about the fact that XML works on tree structures. Because one generally wants networks more generally (of which trees are a subset), I was worried that there are circumstances in which XML is insufficient, and so I played around with RDF for a while. XML applications are much better developed, RDF theoretically is more generalizable...couldn't make a decision.

Then I realized that for historians, we only need trees. For ego-centric networks, cross-relationships between branches are secondary to the relationship to the trunk, which for us means the document. I may need to think on this for a few days (and welcome commentary from my non-existent readership), but I think that we should ultimately be starting with the original documents and marking our way up. Under this paradigm, relations between entities mentioned in the document are functions of the document itself. This is a bit difficult to wrap my mind around, but I think it works like this:
  • For entities mentioned in the head or foot of a document, their relationship is to the document (and to some degree, the implied readership). For example, the "Posted by Ian Miller" at the bottom of this blog post identifies my role, my relationship to the post (writer). In the context of most Chinese documents, this will be in a format like: "著 邰小宝" or "author: Tai Xiaobao".
  • For entities mentioned in the document itself, their relationship is to the setting of the document unless otherwise specified. For example, in an American newspaper, "President Obama" is enough of an indication because in this context he is well known, whereas people from less immediate contexts are specified in more detail, as "German Chancellor Angela Merkel". In the case of historical documents of unclear provenance, this gives us some indication of where they come from. In a contract I am working on, for example, the sellers are identified as "长房林协、次房林必录," or "Lin Xie, primary household; Lin Bilu, secondary household." In other words, Xie is from the eldest son's branch of the family (whether in this generation or a former one), and Bilu is from the second son's branch. This implies that the context of this document is the Lin family or lineage, as the relationship given is to the family. Likewise, if I were to refer to Obama as "eldest son Barack," that would tend to imply that I am referring to him in the context of his family, whereas "President Obama" is in the political context.
  • In some cases, entities are given a context that is not relative to the document. Me referring to Obama as "elder son Barack" seems a bit strange, but I might refer to him as "Ann Dunham's eldest son Barack," which specifies him relative to another context, that of his mother Ann. For the purposes of tree-structured markup however, this does not make the relationship external to the document; in fact, the relationship of Ann and Barack is itself a branch of the document by the fact of its mention.
Ultimately, many applications of the data contained in these documents requires us to pull these exterior relationships out of their context within the document. We may ultimately want to put them into a general network graph that cannot be fully specified by a tree. However, to historians, the source of the data about the relationship is as important as the data itself. To my mind, this means that we must first encode the relationship as part of structure of the document.

It is the nature of most (all?) documents that they can be encoded as trees. This was actually my epiphany today as I was working on coming up with a generalized tree structure for contracts: most documents are essentially the same as contracts; they:
  • Have headers specifying the context of the document (authorship, date...)
  • Specify entities by relationship to the context described by the document. First mentions of entities almost always specify this relationship. In contracts this is often written as so: "The vendor, Spacely Sprockets Incorporated". In other documents (i.e. news articles): Angela Merkel, Chancellor of Germany".
  • Subsequent references to entities generally do not specify this relationship in full. In contracts, the format for these references is often defined at their first mention: "hereafter referred to as 'vendor' or 'Spacely Inc.'", which is not generally the case. Nevertheless, in second and third occurrences, you will usually see "Chancellor Merkel" (Germany no longer being specified), or even just "Merkel."
  • Events and terms are generally broken into relatively discrete chunks to reduce the number of cross-references.
  • Former agreements, quotes or prior events are set apart by quotation marks, paragraph breaks or (as in classical Chinese) set phrases or characters.
This means that I should make my document framework as general as possible. It also means that if I do a good job making a general framework, it can be used to analyze a wide range of documents.

It strikes me that this is the most exiting thing about doing history, or at least the type of history that I want to do. Documents generally come to us in linear format, demanding to be read from front to back. As anyone who has ever diagrammed a sentence or looked at guidelines to format cover letters, documents have underlying tree structures (this is especially clear in the blank contract provided in the Fujian Provincial Statues):


The epiphany of the day is that a general network graph can be built up from the individual trees (i.e. documents). To me, this is the real work of history: finding the trees in texts, building the trees into graphs, and then finding the logic within those graphs. This should not just be an exercise in pasting narratives together, but of seeing the broader shape of things!