Monday, November 14

Unsupervised Learning and Macro-history

It has been difficult to maintain a blog with interesting posts separate from the main thread of my research. The everyday work of grading, reading and writing academic work has placed severe limits on my time and brain-space, so I have let this blog fall to the wayside for quite some time. I have recently decided that I will use this as a space to record and open some of my academic work to outside thought and criticism. A later post will address the reasoning behind this decision in somewhat more depth; for the time being suffice to say that this is a way to spur me to keep writing. Below is the first of these new posts.
I will be presenting a paper at the upcoming Social Science History Association meeting in Boston. This paper is a first attempt at using topic models (specifically a model called the Latent Dirichlet Allocation) to analyze large-scale historical and textual trends. Specifically, I will be analyzing a partial version of the Qing Veritable Records  (Qing shilu 清實錄) starting with the Yongzheng reign through the end of the dynasty. The following is a draft of my introduction:


The promise and failure of digital humanities

The past ten years have witnessed the meteoric rise of the so-called digital humanities. This has become a buzzword in academic circles, the topic of much contemplation among some, and largely ignored as a passing fad by others. Perhaps this is due in part to confusion over what exactly this term is supposed to mean. As far as I am concerned, the digital humanities are the result of two related trends - the large-scale digitization of textual corpora, and the realization that internet technologies could be adapted to research on these corpora. The internet is, after all, a particularly large and complexly organized body of text.

As such, much of the early work in humanities computing has centered on tasks that existing internet technologies solve quite well - searching and tagging. In essentially every discipline and sub-discipline in the academic world, there now exist large textual databases with increasingly powerful search capabilities. At the same time, several large consortia (c.f. TEI, etc.) have worked to develop humanities-specific standards for tagging textual and meta-textual data. As with the internet, the more tagged and interlinked a text, the more responsive it becomes to search technology.

These are both positive trends that have made certain types of research much easier. For example, it was possible for me to write a masters thesis on yoghurt in historical China in less than a year. Previously, given the rather obscure nature of this topic, it would have been necessary to read massive amounts of text looking for occasional reference to dairy; with search technology, it was simply a matter of finding a few keywords and reading the limited sections of these texts that actually referred to the topic of interest. In short, research has become faster and more directed thanks to the reduced overhead of finding reference to topics of interest.

Nonetheless, these technologies have not qualitatively changed the type of research that gets done. Before search capabilities, researchers focused primarily on topics that were of general interest, and frequently on phenomena that had already generated large amounts of scholarship. For very early history, we have limited numbers of extant texts, and these texts have generally been studied comprehensively. But for later periods of history, the tendency to focus on the "interesting" and the "already studied" has left a large body of texts that remain essentially unread - the "great unread":

With the advent of search technology, it has become even easier to focus on the particular topics of interest, as opposed to those dictated by the coverage of the corpus - easier to find topics like "yoghurt" that are not a central concern in most historical Chinese texts. At the same time, the rise of computerized tagging has made well-researched areas even more amenable to further research. Arguably, all that the digital humanities have done thus far is to turn the "great unread" into "the great untagged." In fact, it may have narrowed the focus of research as younger generations become less willing and less able to work on texts that have not been digitized:

Although to a large degree, the most interesting, and most heavily researched corpora are the first ones to be digitized, so this does not really change much.

Macro-history has historically been constrained to a slightly different subset of the received corpora. To deal with very large areas or broad sweeps of time, macro-historians have generally been limited to a few approaches:
  • Reliance on curatorial and bibliographic work and secondary scholarship to cover some of the vast realm of primary sources. 
  • Focus on data that is clearly numerical in nature, allowing the comparatively easy application of statistical technique.
  • Application of social science theory and categories of analysis to provide a framework for organizing and simplifying the vast array of information.
In other words, macro-history has tended to focus on things with numbers arranged into clear categories, frequently in cases where much of the data collection has been handled by previous generations of scholars:

It is worth noting that the statistical orientation of macro-historical work has made it more amenable to computerization from a much earlier date. Arguably, branches of social and economic history have been in the "digital humanities" since the 1970s or earlier.

Network analysis has been another growth area in social history for the past decade. Social network analysis also had roots in sociology in the 1970s, but it was not until the growth of the internet that physicists and computer scientists got interested in the field and its computational capabilities began to take off. The resurgence in network theoretical approaches to historical research has been one area in which the digital turn has led to a recognizable, qualitative shift in the type of research being done. Nonetheless, it has also remained constrained to that familiar subset of textual data - well-ordered, well-researched sources relating primarily to "people of note."

While there are exceptions to these general observations, it is clear that digitization of historical research has not resulted in major shifts in the type of research being done, rather in the extent of that research. It is a truism that as long as research questions are researcher-driven, the areas of analysis will largely be constrained to areas of obvious interest to those researchers at the outset of their projects - notable individuals, events, and topics of longstanding interest to the historiography. There is, however, a sizable body of text into which this research never ventures. What is more, increasing amounts of this text has been digitized, opening it up to less research-intensive methods of exploration. The question that remains is how to take research into this largely uncharted textual realm.

Unsupervised learning and the "great unread"

To me, the answer is to take the direction of the research at least partially out of the hands of the researcher. Another branch of digital technology that has developed in recent years has focused on a different type of machine learning. Search technology is driven in part by what is called supervised learning; the better search algorithms improve their results by taking into account human responses to their earlier searches. But increasingly fields like computational linguistics have delved into unsupervised learning, building algorithms to analyze text based on the text itself, without human intervention at the analytical stage. These algorithms reveal aspects of the underlying structure of texts that are not as explicitly constrained by researchers' conceptions of what the texts should look like. These techniques should allow us to delve into that great unrehearsed realms of digitized corpora:

I hope to show that these unsupervised statistical techniques allow us to look at some of the more diffuse phenomena in history. We already have well-developed methods to look into the lives of famous people, important events and the easily-quantifiable aspects of everyday life (demographics, economics). But aside from the somewhat quixotic attempts of Braudel and his ilk, few have looked into the everyday constants of the unquantifiable past. By undergoing a computerized reading of a large corpus (the Veritiable Records of the Qing Dynasty), this project is able to quantify textual phenomena in a way that is not based on researcher assumptions. In doing so, I am able to look at the progress of five types of temporal change:
  1. Event-based change - the major events and turning points of history
  2. Generational change - in this case, the impact of individual rulers on their historical context
  3. Cyclical change - the repetition of phenomena on a circular basis, in particular the yearly cycle of seasons.
  4. Secular change - gradual, monotonic changes over the course of a long period
  5. Historical constants - phenomena that were a constant part of life in the past, but may or may not be different from the realities of modern life.
I will argue that the last three or four of these categories are poorly addressed by researcher-driven methods, except in cases where readily quantifiable data is available. In assessing cyclical or long-term change in social phenomena, we are generally constrained to the anecdotal observations of historical actors. Historical constants are even worse - our informants tend not to directly address the givens of their existence, these can only be inferred from the background noise of their writing, and then imprecisely at best.

In this method, rather than describing a priori the important categories of analysis, I assume only that there exist such categories. I then use a topic model to determine these categories based on the organization of the corpus itself. A topic model is a statistical model of how documents are constructed from a set of words and topics. At root, it is no different than the implicit model that researchers have about their texts - we assume that the authors of these texts were writing about certain topics, and chose to do so in a certain way. The substitution of a topic model simply implies a unified application of statistical assumptions across the entire corpus, rather than the selective application of unquantified assumptions across a subset of the texts.

The specifics of this topic model (Latent Dirichlet Allocation) and the software implementation (MALLET) that I used will not be described here. The model can be summarized as follows:
  • "Topics" are considered to be a distribution accross all Words in the Corpus, but are generally described by giving a list of the most common words in that topic.
  • "Documents" are considered to be a distribution accross all Topics in the corpus, but are generally described by giving the proportions of the two or three most common Topics.
  • The generative model assumes that a Document is created by "choosing" Topics in a set proportion, and then "choosing" words from those topics randomly, according to the Topic proportions in the Document and the Word proportions in the Topic.
  • The topic model is reverse engineered statistically based on the number of topics, and the observed occurrence of Words within Documents within the Corpus. The research only specifies 4 inputs to the model:
    • Number of Topics
    • Proportion of Topics in the Corpus (but not in individual documents)
    • The Corpus itself, organized into Documents
    • Word delimitation
  • This model was run with 40, 60 and 80 Topics. 60 Topics gave the best combination of specificity (fine-grained Topics) and accuracy (fewest "garbage" Topics and "garbage" Words) and was used in this analysis.
  • All Topics were assumed to occur in equal proportions in the Corpus.
  • The Corpus used is the Veritable Records for last eight reign periods of the Qing Dynasty, covering a period from 1723 to 1911. It is a posthumously-assembled collection of court records for the reign of each emperor of the Qing Dynasty. It is generally well-ordered into Documents (consisting of messages to and from the emperor and records of court events) organized into chronological order with "date stamps" (i.e. the date is specified for the first Document of each day).
  • Words are considered to consist of single Chinese characters.
It should be noted that there are several assumptions made in using this model that may not be representative of the way documents were actually written and compiled in the corpus:
  • Word order is not considered within documents (this is a "bag of words" model). Obviously word order does matter in the actual writing of documents, but this model assumes that it is not important in determining the topics used in a document - in other words, it assumes that grammar may contribute to meaning, but not to topicality.
  • Documents are analyzed synchronically by the topic model (i.e. time is not integrated into the statistical model). Diachronic analysis is reimposed on the Documents after Topics have been generated. Again, the topics were generated over a long period of time and topic usage may have changed over time. In fact, topic drift is revealed by ex post facto analysis of the results, but it is not integrated into the model.
  • Document length is not considered. This is potentially problematic for the statistical conclusions reached on particularly short topics. These topics were primarily records of court events and tend to cluster in certain topics. A solution will be addressed in future research.
  • Regional variation is not incorporated into the topic model. Certain geographic Topics are identified by the model, but are not incorporated into my analysis of the results.
  • Section headers were not removed from the corpus before ananlysis. These are also identified by three topics that only occurred in section headers, and that occur with high proportion in these headers.
  • Written Chinese in the late imperial period did not actually consist of single-character words, single and multiple-character words were both common. Single characters are analyzed because there are no good word delimiting algorithms for classical (or modern) Chinese. As with document length and and section headers, certain types of word formations are actually reverse engineered by the topic analysis, despite the fact that they are not incorporated into the model.
Clearly there are some problematic assumptions in this document model. As I have noted above, the topic analysis proves capable of reverse engineering certain structures that are not part of the model - rationality, temporality, problems in data cleaning, etc. However, one especially problematic assumption is made in this analysis - it does not account for words, grammar or other intratextual structures. This analysis is very powerful at accounting for many corpus-level, intertextual phenomena - exactly the type of analysis needed to probe the dark areas of the textual map. But it is not a substitute for other approaches to understanding the more proximate realms of meaning of individual texts within the corpus.

No comments: