Tuesday, November 22

Fixed Ontologies and Authority Creep

I've been having a recurring conversation recently about what humanists can offer to the world at large.  How can we get out of the ivory tower and offer something that goes beyond a community of specialists. Some academics complain that policy-makers etc don't listen to the humanist perspective. In large part, this is due to the way we cloister ourselves into disciplinary echo-chambers, writing papers that are incomprehensible without years of specialist training. Ironically, moves to interdisciplinary and digital collaboration have only hardened the divisions between areas of specialty (at least according to comments by Katy Borner at IPAM last summer) - it's now easier than ever to collaborate with people who do exactly the same things you do. In any case, my answer has been that humanists need to actively engage with the outside world. Not to wait for them to come to us; to go to them.

What do advanced scholars of the humanities have to offer? We understand media, interpretation, and realms of meaning better than anyone else. We put in the time to learn languages (often dead ones), theories (often obscure ones) and develop a strong sense of context. Advanced scholarship in the sciences and technology involves intensive training in a particular paradigm. The power of this is the ability to think in depth and in detail within a particular ontology. The weakness is it builds in a type of blindness to the limitations of this particular disciplinary brand of knowledge.

By example - the medical profession. Doctors are highly trained, regulated and institutionalized. As a result, they are very good at certain things. However, there is a sort of mission creep - or perhaps "authority creep" - that comes with this level of expertise. Biomedical doctors are very good at treating acute injuries and many kinds of (bacterial) infections; they are generally not good at treating pain and other (viral) infections. But because viral infections look a lot like bacterial infections and pain often accompanies injury, these are also assigned to doctors' area of expertise. Frequently, for no particularly good reason, products will instruct you to "consult your doctor before using." This itself is the result of another profession's "authority creep" - lawyers.

But the big problem is not that doctors are bad at treating things like pain, obesity and the common cold. The problem is that most doctors do not recognize or admit that they are bad at them. This is a problem of operating within a fixed ontology without seeing its limits. Doctors are trained to look for a chain of causal factors that can be identified either in the laboratory or through double-blind experiments. This is very good at revealing certain types of causalities. We manifest symptoms of a cold when we are infected by certain viruses. At some level, the cold is "caused" by the virus. Therefore, the best solution is assumed to be preventing or curing viral infections. At some level, this understanding is useful. Things like hand-washing and other means of preventing exposure help us avoid infection by understanding the vector of transmission. But the overwhelming focus on the proximate vector of infection keeps us from recognizing other causal factors - stress, weather change, and so on - and from taking preventative or curative action at this level.

Obesity is an even stronger example of medical authority creep that distracts from more meaningful causes, and perhaps more meaningful solutions. Doctors deal with obesity because it is itself a physical ailment, and it leads to many more serious physical ailments - cardiovascular disease, diabetes, even cancer - that are more clearly within the medical realm (although these are generally things that bio-medicine treats decently - not badly, but not particularly well). The thing is, even as medical knowledge advances, people keep getting fatter.

This is because of the mechanistic view of the body that is omnipresent in bio-medicine. When you break it into individual mechanisms, it is clear why we get fat. We get fat because we eat food that is not used for energy or released as waste. Doctors have ever more fine-grained understanding of the mechanisms by which we put on weight. Ultimately, it reduces to some pretty simple points - if you eat too much - particularly of certain things like sugar (and maybe fat), and you don't exercise enough, you get fat. Thousands of diets have been based around this fundamental realization - instructing people to eat less of certain things, or less at certain times, or just less in general. And these diets have failure rates of essentially 100%. The proximate causes of obesity do not reveal the right way to treat it.

The thing is, obesity has other causal mechanisms - ones that are not easily incorporated into the biomedical ontology. Obesity has a strong spatial dimension:

and a strong social dimension. Some people have suggested these factors should be taken into account in fighting obesity, but most of the medical community remains focused on changing diet, exercise, or using drugs to target certain metabolic pathways. These approaches fail, and are fundamentally wrongheaded.

The specifics of this are somewhat of a side point (albeit an important one). It actually looks like things like zoning laws may be more important in fighting obesity than diet-and-exercise plans. This does not sit well with most doctors, because there is little room in their ontology of disease for spatial and social factors. This is one of the revelations that humanists and social scientists have to offer the world - a recognition of the limits of usefulness of disciplinary paradigms. Americans are getting shorter, something that is generally a good measure of general falling health and standards of living. This happened before, at other times when we were ostensibly progressing (such as the industrial revolution). If "better" understandings of obesity still lead to more of it, that seems like an occasion to reexamine what exactly "better" understanding means. And examining the boundaries of understanding is a job for humanists.

Monday, November 21

SSHA 2011

I presented at SSHA 2011 this weekend, where I was on a panel with Daniel Little, among others. Lots of interesting stuff on stature for reconstructing living standards in the past, social networks, etc. Having posted the introduction to my paper, I thought it worth presenting a bit more here.

Here are some highlights from my slides from SSHA 2011. I argue that digital humanities, centered thus far on search and tagging capabilities, has not made a qualitative shift in the type of historical research being done.
Unsupervised topic modeling, because it does not reproduce the researcher's bias for "interesting" topics (major events, famous people) or for topics that are easily addressed by existing historiography has much potential for investigating the "great unread" - the large portion of the corpus that is essentially unstudied.
As an example, I show that "bandits" (zei 贼) are hard to pick out of Chinese texts based on keyword searching. Single-character keywords are too general (everything from embezzling ministers to thieves to rebels),



and multi-character keywords are too specific (they miss more than 70% of the occurences of 贼).

To approach this type of diffuse and linguistically amorphous phenomenon, topic modeling works better. My 60-topic LDA pulled out 5 topics related to bandits, that nicely divide among the different phenomena associated with the keyword - corruption, law and order, trials, rebels, and counter-rebellion.

These topics not only allow for better searching of the corpus, but also show different temporal characteristics. Law and order Topics are highly seasonal (peaking in the early fall) while rebellion is not.



Rebellion is also highly driven by specific major events (for example, the Taiping Rebellion), while the law and order Topics are more endemic at a low level (with some cyclical and apparently random variation as well).
I also presented several other Topics that don't relate to banditry. Ritual allows us to easily pick out the reign periods - all three ritual topics spike when there is a new emperor. We can also clearly see a drop-off in the Imperial Family topic when Qianlong's mother dies:


Another surprising result was the ability of the topic model to find phonetic characters commonly used for names. In Chinese, there are some characters that are almost exclusively used in muli-character compounds for spelling foreign names (e.g. 尔克阿巴图布啦勒萨etc.); the association of these characters was discovered in the corpus and assigned to two topics. The program also found common Chinese surnames (张陈李吴王刘朱周etc.) and assigned them to a topic, and found other characters that occur largely in given names or names of temples and the like (文玉麟祥福荣, etc.) and assigned them to a second. Without incorporating the idea of names or multi-character words into the model, it was nonetheless able to find many instances of these and group them together. If we graph a composite of the Han names versus a composite of the non-Han names, several things jump out: the prevalence of non-Han names during Qianlong's western campaigns, and a secular decline in non-Han and increase in Han names (or perhaps, an event-driven switch to the importance of Han officials at the time of the Taiping Rebellion, a concept well-supported in the historiography).

Any comments would be much appreciated, as I'm trying to turn this into a more formal article as we speak.

Monday, November 14

Unsupervised Learning and Macro-history

It has been difficult to maintain a blog with interesting posts separate from the main thread of my research. The everyday work of grading, reading and writing academic work has placed severe limits on my time and brain-space, so I have let this blog fall to the wayside for quite some time. I have recently decided that I will use this as a space to record and open some of my academic work to outside thought and criticism. A later post will address the reasoning behind this decision in somewhat more depth; for the time being suffice to say that this is a way to spur me to keep writing. Below is the first of these new posts.
I will be presenting a paper at the upcoming Social Science History Association meeting in Boston. This paper is a first attempt at using topic models (specifically a model called the Latent Dirichlet Allocation) to analyze large-scale historical and textual trends. Specifically, I will be analyzing a partial version of the Qing Veritable Records  (Qing shilu 清實錄) starting with the Yongzheng reign through the end of the dynasty. The following is a draft of my introduction:


The promise and failure of digital humanities

The past ten years have witnessed the meteoric rise of the so-called digital humanities. This has become a buzzword in academic circles, the topic of much contemplation among some, and largely ignored as a passing fad by others. Perhaps this is due in part to confusion over what exactly this term is supposed to mean. As far as I am concerned, the digital humanities are the result of two related trends - the large-scale digitization of textual corpora, and the realization that internet technologies could be adapted to research on these corpora. The internet is, after all, a particularly large and complexly organized body of text.

As such, much of the early work in humanities computing has centered on tasks that existing internet technologies solve quite well - searching and tagging. In essentially every discipline and sub-discipline in the academic world, there now exist large textual databases with increasingly powerful search capabilities. At the same time, several large consortia (c.f. TEI, etc.) have worked to develop humanities-specific standards for tagging textual and meta-textual data. As with the internet, the more tagged and interlinked a text, the more responsive it becomes to search technology.

These are both positive trends that have made certain types of research much easier. For example, it was possible for me to write a masters thesis on yoghurt in historical China in less than a year. Previously, given the rather obscure nature of this topic, it would have been necessary to read massive amounts of text looking for occasional reference to dairy; with search technology, it was simply a matter of finding a few keywords and reading the limited sections of these texts that actually referred to the topic of interest. In short, research has become faster and more directed thanks to the reduced overhead of finding reference to topics of interest.

Nonetheless, these technologies have not qualitatively changed the type of research that gets done. Before search capabilities, researchers focused primarily on topics that were of general interest, and frequently on phenomena that had already generated large amounts of scholarship. For very early history, we have limited numbers of extant texts, and these texts have generally been studied comprehensively. But for later periods of history, the tendency to focus on the "interesting" and the "already studied" has left a large body of texts that remain essentially unread - the "great unread":

With the advent of search technology, it has become even easier to focus on the particular topics of interest, as opposed to those dictated by the coverage of the corpus - easier to find topics like "yoghurt" that are not a central concern in most historical Chinese texts. At the same time, the rise of computerized tagging has made well-researched areas even more amenable to further research. Arguably, all that the digital humanities have done thus far is to turn the "great unread" into "the great untagged." In fact, it may have narrowed the focus of research as younger generations become less willing and less able to work on texts that have not been digitized:

Although to a large degree, the most interesting, and most heavily researched corpora are the first ones to be digitized, so this does not really change much.

Macro-history has historically been constrained to a slightly different subset of the received corpora. To deal with very large areas or broad sweeps of time, macro-historians have generally been limited to a few approaches:
  • Reliance on curatorial and bibliographic work and secondary scholarship to cover some of the vast realm of primary sources. 
  • Focus on data that is clearly numerical in nature, allowing the comparatively easy application of statistical technique.
  • Application of social science theory and categories of analysis to provide a framework for organizing and simplifying the vast array of information.
In other words, macro-history has tended to focus on things with numbers arranged into clear categories, frequently in cases where much of the data collection has been handled by previous generations of scholars:

It is worth noting that the statistical orientation of macro-historical work has made it more amenable to computerization from a much earlier date. Arguably, branches of social and economic history have been in the "digital humanities" since the 1970s or earlier.

Network analysis has been another growth area in social history for the past decade. Social network analysis also had roots in sociology in the 1970s, but it was not until the growth of the internet that physicists and computer scientists got interested in the field and its computational capabilities began to take off. The resurgence in network theoretical approaches to historical research has been one area in which the digital turn has led to a recognizable, qualitative shift in the type of research being done. Nonetheless, it has also remained constrained to that familiar subset of textual data - well-ordered, well-researched sources relating primarily to "people of note."

While there are exceptions to these general observations, it is clear that digitization of historical research has not resulted in major shifts in the type of research being done, rather in the extent of that research. It is a truism that as long as research questions are researcher-driven, the areas of analysis will largely be constrained to areas of obvious interest to those researchers at the outset of their projects - notable individuals, events, and topics of longstanding interest to the historiography. There is, however, a sizable body of text into which this research never ventures. What is more, increasing amounts of this text has been digitized, opening it up to less research-intensive methods of exploration. The question that remains is how to take research into this largely uncharted textual realm.

Unsupervised learning and the "great unread"

To me, the answer is to take the direction of the research at least partially out of the hands of the researcher. Another branch of digital technology that has developed in recent years has focused on a different type of machine learning. Search technology is driven in part by what is called supervised learning; the better search algorithms improve their results by taking into account human responses to their earlier searches. But increasingly fields like computational linguistics have delved into unsupervised learning, building algorithms to analyze text based on the text itself, without human intervention at the analytical stage. These algorithms reveal aspects of the underlying structure of texts that are not as explicitly constrained by researchers' conceptions of what the texts should look like. These techniques should allow us to delve into that great unrehearsed realms of digitized corpora:

I hope to show that these unsupervised statistical techniques allow us to look at some of the more diffuse phenomena in history. We already have well-developed methods to look into the lives of famous people, important events and the easily-quantifiable aspects of everyday life (demographics, economics). But aside from the somewhat quixotic attempts of Braudel and his ilk, few have looked into the everyday constants of the unquantifiable past. By undergoing a computerized reading of a large corpus (the Veritiable Records of the Qing Dynasty), this project is able to quantify textual phenomena in a way that is not based on researcher assumptions. In doing so, I am able to look at the progress of five types of temporal change:
  1. Event-based change - the major events and turning points of history
  2. Generational change - in this case, the impact of individual rulers on their historical context
  3. Cyclical change - the repetition of phenomena on a circular basis, in particular the yearly cycle of seasons.
  4. Secular change - gradual, monotonic changes over the course of a long period
  5. Historical constants - phenomena that were a constant part of life in the past, but may or may not be different from the realities of modern life.
I will argue that the last three or four of these categories are poorly addressed by researcher-driven methods, except in cases where readily quantifiable data is available. In assessing cyclical or long-term change in social phenomena, we are generally constrained to the anecdotal observations of historical actors. Historical constants are even worse - our informants tend not to directly address the givens of their existence, these can only be inferred from the background noise of their writing, and then imprecisely at best.

In this method, rather than describing a priori the important categories of analysis, I assume only that there exist such categories. I then use a topic model to determine these categories based on the organization of the corpus itself. A topic model is a statistical model of how documents are constructed from a set of words and topics. At root, it is no different than the implicit model that researchers have about their texts - we assume that the authors of these texts were writing about certain topics, and chose to do so in a certain way. The substitution of a topic model simply implies a unified application of statistical assumptions across the entire corpus, rather than the selective application of unquantified assumptions across a subset of the texts.

The specifics of this topic model (Latent Dirichlet Allocation) and the software implementation (MALLET) that I used will not be described here. The model can be summarized as follows:
  • "Topics" are considered to be a distribution accross all Words in the Corpus, but are generally described by giving a list of the most common words in that topic.
  • "Documents" are considered to be a distribution accross all Topics in the corpus, but are generally described by giving the proportions of the two or three most common Topics.
  • The generative model assumes that a Document is created by "choosing" Topics in a set proportion, and then "choosing" words from those topics randomly, according to the Topic proportions in the Document and the Word proportions in the Topic.
  • The topic model is reverse engineered statistically based on the number of topics, and the observed occurrence of Words within Documents within the Corpus. The research only specifies 4 inputs to the model:
    • Number of Topics
    • Proportion of Topics in the Corpus (but not in individual documents)
    • The Corpus itself, organized into Documents
    • Word delimitation
  • This model was run with 40, 60 and 80 Topics. 60 Topics gave the best combination of specificity (fine-grained Topics) and accuracy (fewest "garbage" Topics and "garbage" Words) and was used in this analysis.
  • All Topics were assumed to occur in equal proportions in the Corpus.
  • The Corpus used is the Veritable Records for last eight reign periods of the Qing Dynasty, covering a period from 1723 to 1911. It is a posthumously-assembled collection of court records for the reign of each emperor of the Qing Dynasty. It is generally well-ordered into Documents (consisting of messages to and from the emperor and records of court events) organized into chronological order with "date stamps" (i.e. the date is specified for the first Document of each day).
  • Words are considered to consist of single Chinese characters.
It should be noted that there are several assumptions made in using this model that may not be representative of the way documents were actually written and compiled in the corpus:
  • Word order is not considered within documents (this is a "bag of words" model). Obviously word order does matter in the actual writing of documents, but this model assumes that it is not important in determining the topics used in a document - in other words, it assumes that grammar may contribute to meaning, but not to topicality.
  • Documents are analyzed synchronically by the topic model (i.e. time is not integrated into the statistical model). Diachronic analysis is reimposed on the Documents after Topics have been generated. Again, the topics were generated over a long period of time and topic usage may have changed over time. In fact, topic drift is revealed by ex post facto analysis of the results, but it is not integrated into the model.
  • Document length is not considered. This is potentially problematic for the statistical conclusions reached on particularly short topics. These topics were primarily records of court events and tend to cluster in certain topics. A solution will be addressed in future research.
  • Regional variation is not incorporated into the topic model. Certain geographic Topics are identified by the model, but are not incorporated into my analysis of the results.
  • Section headers were not removed from the corpus before ananlysis. These are also identified by three topics that only occurred in section headers, and that occur with high proportion in these headers.
  • Written Chinese in the late imperial period did not actually consist of single-character words, single and multiple-character words were both common. Single characters are analyzed because there are no good word delimiting algorithms for classical (or modern) Chinese. As with document length and and section headers, certain types of word formations are actually reverse engineered by the topic analysis, despite the fact that they are not incorporated into the model.
Clearly there are some problematic assumptions in this document model. As I have noted above, the topic analysis proves capable of reverse engineering certain structures that are not part of the model - rationality, temporality, problems in data cleaning, etc. However, one especially problematic assumption is made in this analysis - it does not account for words, grammar or other intratextual structures. This analysis is very powerful at accounting for many corpus-level, intertextual phenomena - exactly the type of analysis needed to probe the dark areas of the textual map. But it is not a substitute for other approaches to understanding the more proximate realms of meaning of individual texts within the corpus.