Monday, November 21

SSHA 2011

I presented at SSHA 2011 this weekend, where I was on a panel with Daniel Little, among others. Lots of interesting stuff on stature for reconstructing living standards in the past, social networks, etc. Having posted the introduction to my paper, I thought it worth presenting a bit more here.

Here are some highlights from my slides from SSHA 2011. I argue that digital humanities, centered thus far on search and tagging capabilities, has not made a qualitative shift in the type of historical research being done.
Unsupervised topic modeling, because it does not reproduce the researcher's bias for "interesting" topics (major events, famous people) or for topics that are easily addressed by existing historiography has much potential for investigating the "great unread" - the large portion of the corpus that is essentially unstudied.
As an example, I show that "bandits" (zei 贼) are hard to pick out of Chinese texts based on keyword searching. Single-character keywords are too general (everything from embezzling ministers to thieves to rebels),



and multi-character keywords are too specific (they miss more than 70% of the occurences of 贼).

To approach this type of diffuse and linguistically amorphous phenomenon, topic modeling works better. My 60-topic LDA pulled out 5 topics related to bandits, that nicely divide among the different phenomena associated with the keyword - corruption, law and order, trials, rebels, and counter-rebellion.

These topics not only allow for better searching of the corpus, but also show different temporal characteristics. Law and order Topics are highly seasonal (peaking in the early fall) while rebellion is not.



Rebellion is also highly driven by specific major events (for example, the Taiping Rebellion), while the law and order Topics are more endemic at a low level (with some cyclical and apparently random variation as well).
I also presented several other Topics that don't relate to banditry. Ritual allows us to easily pick out the reign periods - all three ritual topics spike when there is a new emperor. We can also clearly see a drop-off in the Imperial Family topic when Qianlong's mother dies:


Another surprising result was the ability of the topic model to find phonetic characters commonly used for names. In Chinese, there are some characters that are almost exclusively used in muli-character compounds for spelling foreign names (e.g. 尔克阿巴图布啦勒萨etc.); the association of these characters was discovered in the corpus and assigned to two topics. The program also found common Chinese surnames (张陈李吴王刘朱周etc.) and assigned them to a topic, and found other characters that occur largely in given names or names of temples and the like (文玉麟祥福荣, etc.) and assigned them to a second. Without incorporating the idea of names or multi-character words into the model, it was nonetheless able to find many instances of these and group them together. If we graph a composite of the Han names versus a composite of the non-Han names, several things jump out: the prevalence of non-Han names during Qianlong's western campaigns, and a secular decline in non-Han and increase in Han names (or perhaps, an event-driven switch to the importance of Han officials at the time of the Taiping Rebellion, a concept well-supported in the historiography).

Any comments would be much appreciated, as I'm trying to turn this into a more formal article as we speak.

No comments: