Here are some highlights from my slides from SSHA 2011. I argue that digital humanities, centered thus far on search and tagging capabilities, has not made a qualitative shift in the type of historical research being done.
Unsupervised topic modeling, because it does not reproduce the researcher's bias for "interesting" topics (major events, famous people) or for topics that are easily addressed by existing historiography has much potential for investigating the "great unread" - the large portion of the corpus that is essentially unstudied.
As an example, I show that "bandits" (zei 贼) are hard to pick out of Chinese texts based on keyword searching. Single-character keywords are too general (everything from embezzling ministers to thieves to rebels),
and multi-character keywords are too specific (they miss more than 70% of the occurences of 贼).
To approach this type of diffuse and linguistically amorphous phenomenon, topic modeling works better. My 60-topic LDA pulled out 5 topics related to bandits, that nicely divide among the different phenomena associated with the keyword - corruption, law and order, trials, rebels, and counter-rebellion.
These topics not only allow for better searching of the corpus, but also show different temporal characteristics. Law and order Topics are highly seasonal (peaking in the early fall) while rebellion is not.
Rebellion is also highly driven by specific major events (for example, the Taiping Rebellion), while the law and order Topics are more endemic at a low level (with some cyclical and apparently random variation as well).
I also presented several other Topics that don't relate to banditry. Ritual allows us to easily pick out the reign periods - all three ritual topics spike when there is a new emperor. We can also clearly see a drop-off in the Imperial Family topic when Qianlong's mother dies:
Any comments would be much appreciated, as I'm trying to turn this into a more formal article as we speak.
No comments:
Post a Comment