Monday, November 21

SSHA 2011

I presented at SSHA 2011 this weekend, where I was on a panel with Daniel Little, among others. Lots of interesting stuff on stature for reconstructing living standards in the past, social networks, etc. Having posted the introduction to my paper, I thought it worth presenting a bit more here.

Here are some highlights from my slides from SSHA 2011. I argue that digital humanities, centered thus far on search and tagging capabilities, has not made a qualitative shift in the type of historical research being done.
Unsupervised topic modeling, because it does not reproduce the researcher's bias for "interesting" topics (major events, famous people) or for topics that are easily addressed by existing historiography has much potential for investigating the "great unread" - the large portion of the corpus that is essentially unstudied.
As an example, I show that "bandits" (zei 贼) are hard to pick out of Chinese texts based on keyword searching. Single-character keywords are too general (everything from embezzling ministers to thieves to rebels),



and multi-character keywords are too specific (they miss more than 70% of the occurences of 贼).

To approach this type of diffuse and linguistically amorphous phenomenon, topic modeling works better. My 60-topic LDA pulled out 5 topics related to bandits, that nicely divide among the different phenomena associated with the keyword - corruption, law and order, trials, rebels, and counter-rebellion.

These topics not only allow for better searching of the corpus, but also show different temporal characteristics. Law and order Topics are highly seasonal (peaking in the early fall) while rebellion is not.



Rebellion is also highly driven by specific major events (for example, the Taiping Rebellion), while the law and order Topics are more endemic at a low level (with some cyclical and apparently random variation as well).
I also presented several other Topics that don't relate to banditry. Ritual allows us to easily pick out the reign periods - all three ritual topics spike when there is a new emperor. We can also clearly see a drop-off in the Imperial Family topic when Qianlong's mother dies:


Another surprising result was the ability of the topic model to find phonetic characters commonly used for names. In Chinese, there are some characters that are almost exclusively used in muli-character compounds for spelling foreign names (e.g. 尔克阿巴图布啦勒萨etc.); the association of these characters was discovered in the corpus and assigned to two topics. The program also found common Chinese surnames (张陈李吴王刘朱周etc.) and assigned them to a topic, and found other characters that occur largely in given names or names of temples and the like (文玉麟祥福荣, etc.) and assigned them to a second. Without incorporating the idea of names or multi-character words into the model, it was nonetheless able to find many instances of these and group them together. If we graph a composite of the Han names versus a composite of the non-Han names, several things jump out: the prevalence of non-Han names during Qianlong's western campaigns, and a secular decline in non-Han and increase in Han names (or perhaps, an event-driven switch to the importance of Han officials at the time of the Taiping Rebellion, a concept well-supported in the historiography).

Any comments would be much appreciated, as I'm trying to turn this into a more formal article as we speak.

Monday, November 14

Unsupervised Learning and Macro-history

It has been difficult to maintain a blog with interesting posts separate from the main thread of my research. The everyday work of grading, reading and writing academic work has placed severe limits on my time and brain-space, so I have let this blog fall to the wayside for quite some time. I have recently decided that I will use this as a space to record and open some of my academic work to outside thought and criticism. A later post will address the reasoning behind this decision in somewhat more depth; for the time being suffice to say that this is a way to spur me to keep writing. Below is the first of these new posts.
I will be presenting a paper at the upcoming Social Science History Association meeting in Boston. This paper is a first attempt at using topic models (specifically a model called the Latent Dirichlet Allocation) to analyze large-scale historical and textual trends. Specifically, I will be analyzing a partial version of the Qing Veritable Records  (Qing shilu 清實錄) starting with the Yongzheng reign through the end of the dynasty. The following is a draft of my introduction:


The promise and failure of digital humanities

The past ten years have witnessed the meteoric rise of the so-called digital humanities. This has become a buzzword in academic circles, the topic of much contemplation among some, and largely ignored as a passing fad by others. Perhaps this is due in part to confusion over what exactly this term is supposed to mean. As far as I am concerned, the digital humanities are the result of two related trends - the large-scale digitization of textual corpora, and the realization that internet technologies could be adapted to research on these corpora. The internet is, after all, a particularly large and complexly organized body of text.

As such, much of the early work in humanities computing has centered on tasks that existing internet technologies solve quite well - searching and tagging. In essentially every discipline and sub-discipline in the academic world, there now exist large textual databases with increasingly powerful search capabilities. At the same time, several large consortia (c.f. TEI, etc.) have worked to develop humanities-specific standards for tagging textual and meta-textual data. As with the internet, the more tagged and interlinked a text, the more responsive it becomes to search technology.

These are both positive trends that have made certain types of research much easier. For example, it was possible for me to write a masters thesis on yoghurt in historical China in less than a year. Previously, given the rather obscure nature of this topic, it would have been necessary to read massive amounts of text looking for occasional reference to dairy; with search technology, it was simply a matter of finding a few keywords and reading the limited sections of these texts that actually referred to the topic of interest. In short, research has become faster and more directed thanks to the reduced overhead of finding reference to topics of interest.

Nonetheless, these technologies have not qualitatively changed the type of research that gets done. Before search capabilities, researchers focused primarily on topics that were of general interest, and frequently on phenomena that had already generated large amounts of scholarship. For very early history, we have limited numbers of extant texts, and these texts have generally been studied comprehensively. But for later periods of history, the tendency to focus on the "interesting" and the "already studied" has left a large body of texts that remain essentially unread - the "great unread":

With the advent of search technology, it has become even easier to focus on the particular topics of interest, as opposed to those dictated by the coverage of the corpus - easier to find topics like "yoghurt" that are not a central concern in most historical Chinese texts. At the same time, the rise of computerized tagging has made well-researched areas even more amenable to further research. Arguably, all that the digital humanities have done thus far is to turn the "great unread" into "the great untagged." In fact, it may have narrowed the focus of research as younger generations become less willing and less able to work on texts that have not been digitized:

Although to a large degree, the most interesting, and most heavily researched corpora are the first ones to be digitized, so this does not really change much.

Macro-history has historically been constrained to a slightly different subset of the received corpora. To deal with very large areas or broad sweeps of time, macro-historians have generally been limited to a few approaches:
  • Reliance on curatorial and bibliographic work and secondary scholarship to cover some of the vast realm of primary sources. 
  • Focus on data that is clearly numerical in nature, allowing the comparatively easy application of statistical technique.
  • Application of social science theory and categories of analysis to provide a framework for organizing and simplifying the vast array of information.
In other words, macro-history has tended to focus on things with numbers arranged into clear categories, frequently in cases where much of the data collection has been handled by previous generations of scholars:

It is worth noting that the statistical orientation of macro-historical work has made it more amenable to computerization from a much earlier date. Arguably, branches of social and economic history have been in the "digital humanities" since the 1970s or earlier.

Network analysis has been another growth area in social history for the past decade. Social network analysis also had roots in sociology in the 1970s, but it was not until the growth of the internet that physicists and computer scientists got interested in the field and its computational capabilities began to take off. The resurgence in network theoretical approaches to historical research has been one area in which the digital turn has led to a recognizable, qualitative shift in the type of research being done. Nonetheless, it has also remained constrained to that familiar subset of textual data - well-ordered, well-researched sources relating primarily to "people of note."

While there are exceptions to these general observations, it is clear that digitization of historical research has not resulted in major shifts in the type of research being done, rather in the extent of that research. It is a truism that as long as research questions are researcher-driven, the areas of analysis will largely be constrained to areas of obvious interest to those researchers at the outset of their projects - notable individuals, events, and topics of longstanding interest to the historiography. There is, however, a sizable body of text into which this research never ventures. What is more, increasing amounts of this text has been digitized, opening it up to less research-intensive methods of exploration. The question that remains is how to take research into this largely uncharted textual realm.

Unsupervised learning and the "great unread"

To me, the answer is to take the direction of the research at least partially out of the hands of the researcher. Another branch of digital technology that has developed in recent years has focused on a different type of machine learning. Search technology is driven in part by what is called supervised learning; the better search algorithms improve their results by taking into account human responses to their earlier searches. But increasingly fields like computational linguistics have delved into unsupervised learning, building algorithms to analyze text based on the text itself, without human intervention at the analytical stage. These algorithms reveal aspects of the underlying structure of texts that are not as explicitly constrained by researchers' conceptions of what the texts should look like. These techniques should allow us to delve into that great unrehearsed realms of digitized corpora:

I hope to show that these unsupervised statistical techniques allow us to look at some of the more diffuse phenomena in history. We already have well-developed methods to look into the lives of famous people, important events and the easily-quantifiable aspects of everyday life (demographics, economics). But aside from the somewhat quixotic attempts of Braudel and his ilk, few have looked into the everyday constants of the unquantifiable past. By undergoing a computerized reading of a large corpus (the Veritiable Records of the Qing Dynasty), this project is able to quantify textual phenomena in a way that is not based on researcher assumptions. In doing so, I am able to look at the progress of five types of temporal change:
  1. Event-based change - the major events and turning points of history
  2. Generational change - in this case, the impact of individual rulers on their historical context
  3. Cyclical change - the repetition of phenomena on a circular basis, in particular the yearly cycle of seasons.
  4. Secular change - gradual, monotonic changes over the course of a long period
  5. Historical constants - phenomena that were a constant part of life in the past, but may or may not be different from the realities of modern life.
I will argue that the last three or four of these categories are poorly addressed by researcher-driven methods, except in cases where readily quantifiable data is available. In assessing cyclical or long-term change in social phenomena, we are generally constrained to the anecdotal observations of historical actors. Historical constants are even worse - our informants tend not to directly address the givens of their existence, these can only be inferred from the background noise of their writing, and then imprecisely at best.

In this method, rather than describing a priori the important categories of analysis, I assume only that there exist such categories. I then use a topic model to determine these categories based on the organization of the corpus itself. A topic model is a statistical model of how documents are constructed from a set of words and topics. At root, it is no different than the implicit model that researchers have about their texts - we assume that the authors of these texts were writing about certain topics, and chose to do so in a certain way. The substitution of a topic model simply implies a unified application of statistical assumptions across the entire corpus, rather than the selective application of unquantified assumptions across a subset of the texts.

The specifics of this topic model (Latent Dirichlet Allocation) and the software implementation (MALLET) that I used will not be described here. The model can be summarized as follows:
  • "Topics" are considered to be a distribution accross all Words in the Corpus, but are generally described by giving a list of the most common words in that topic.
  • "Documents" are considered to be a distribution accross all Topics in the corpus, but are generally described by giving the proportions of the two or three most common Topics.
  • The generative model assumes that a Document is created by "choosing" Topics in a set proportion, and then "choosing" words from those topics randomly, according to the Topic proportions in the Document and the Word proportions in the Topic.
  • The topic model is reverse engineered statistically based on the number of topics, and the observed occurrence of Words within Documents within the Corpus. The research only specifies 4 inputs to the model:
    • Number of Topics
    • Proportion of Topics in the Corpus (but not in individual documents)
    • The Corpus itself, organized into Documents
    • Word delimitation
  • This model was run with 40, 60 and 80 Topics. 60 Topics gave the best combination of specificity (fine-grained Topics) and accuracy (fewest "garbage" Topics and "garbage" Words) and was used in this analysis.
  • All Topics were assumed to occur in equal proportions in the Corpus.
  • The Corpus used is the Veritable Records for last eight reign periods of the Qing Dynasty, covering a period from 1723 to 1911. It is a posthumously-assembled collection of court records for the reign of each emperor of the Qing Dynasty. It is generally well-ordered into Documents (consisting of messages to and from the emperor and records of court events) organized into chronological order with "date stamps" (i.e. the date is specified for the first Document of each day).
  • Words are considered to consist of single Chinese characters.
It should be noted that there are several assumptions made in using this model that may not be representative of the way documents were actually written and compiled in the corpus:
  • Word order is not considered within documents (this is a "bag of words" model). Obviously word order does matter in the actual writing of documents, but this model assumes that it is not important in determining the topics used in a document - in other words, it assumes that grammar may contribute to meaning, but not to topicality.
  • Documents are analyzed synchronically by the topic model (i.e. time is not integrated into the statistical model). Diachronic analysis is reimposed on the Documents after Topics have been generated. Again, the topics were generated over a long period of time and topic usage may have changed over time. In fact, topic drift is revealed by ex post facto analysis of the results, but it is not integrated into the model.
  • Document length is not considered. This is potentially problematic for the statistical conclusions reached on particularly short topics. These topics were primarily records of court events and tend to cluster in certain topics. A solution will be addressed in future research.
  • Regional variation is not incorporated into the topic model. Certain geographic Topics are identified by the model, but are not incorporated into my analysis of the results.
  • Section headers were not removed from the corpus before ananlysis. These are also identified by three topics that only occurred in section headers, and that occur with high proportion in these headers.
  • Written Chinese in the late imperial period did not actually consist of single-character words, single and multiple-character words were both common. Single characters are analyzed because there are no good word delimiting algorithms for classical (or modern) Chinese. As with document length and and section headers, certain types of word formations are actually reverse engineered by the topic analysis, despite the fact that they are not incorporated into the model.
Clearly there are some problematic assumptions in this document model. As I have noted above, the topic analysis proves capable of reverse engineering certain structures that are not part of the model - rationality, temporality, problems in data cleaning, etc. However, one especially problematic assumption is made in this analysis - it does not account for words, grammar or other intratextual structures. This analysis is very powerful at accounting for many corpus-level, intertextual phenomena - exactly the type of analysis needed to probe the dark areas of the textual map. But it is not a substitute for other approaches to understanding the more proximate realms of meaning of individual texts within the corpus.

Monday, September 19

Tangentially related thoughs - cyborg simulations

This is only slightly related to the discussion on nomadic academics of the past two posts. It is related in the sense that it is about alternative ways of relating to data and data-generation. There has been a decent amount of work on virtual economies and what they can tell us about RL economies. More recently, there has been some breakthrough in AIDS research (hat tip @javiercha) coming out of a tinker-toy type game. What these situations have in common is the combination of computer simulation with human interaction. The simulation is not complete without BOTH elements. This strikes me as an especially powerful way of doing certain types of research, something that goes beyond crowdsourcing, beyond complex systems models (like Conway's Game of Life). This is also different than gamification (which is bullshit, btw), or Jane McGonigal's games to change the world stuff. This is games to study the world.

This is cyborg simulation (half-human, half-computer). The promise, to me, is to combine the things that computers do well (crunching numbers, remembering things, setting limits) and the things that humans do well (some types of pattern recognition, teamwork, "creativity"). Wow. Think about that!

Friday, September 16

The Nomadic Academic, Part 2



I've gotten some good responses on my first post, and a nice link from my old classmate Graham Webster on his blog infopolitics. He poses a very important question, one that I hope to address in more detail in a later post. Much as I may use food and energy as extended metaphors for knowledge or information, the nomadic academic does not live on books alone. How then, Graham asks, are academics to buy real food and real gas while they are off in the metaphoric steppe herding diffuse information? His answer, in short, is to combine academic and non-academic (a.k.a. "real world") employment. You should go read it in full. But how much useful work will this produce? And can it compete with the work produced by those fully in the academe?

As I said, I hope to tackle these questions at more length in a later post. But I have spent the week thinking about something else, so first this:

Part 2: The Secondary Products Revolution

In the previous post, I qualified nomadism as a strategy of mobility. Real-world nomadism as an economic-ecological adaptation required the horse and certain horse technologies (bit, stirrups) to allow people to cover more ground and control more animals than they could on foot. By analogy, certain computer technologies (full-text searching, translation software) allow the pioneer generations of nomadic academics to cover more ground their their precursors. It used to be the work of a career to read a major corpus (say, six or eight Dynastic Histories) looking for specific topics or terms (yogurt, for example). Nowadays, it is possible for a Young Turk who barely reads Chinese to write a masters thesis on yoghurt covering Sima Qian to the Northern Wei and the Southern Song (that's 14 centuries, for those keeping track at home). I know this because I wrote such a paper.

Horseback riding is only one of several critical technologies on the way to nomadism. Superior coverage of diffuse information is a new productive paradigm, for sure. It enables quicker and easier access to a broader base of information that we would otherwise have. However, the end product is the same. To extend the metaphor, the old way to write a book with broad coverage was to expend a huge number of person-hours gathering in all your sheep to sell them at market. The new alternative is to ride a horse (i.e. search engine) for fewer people to expend fewer hours bringing those same sheep to market. This still does not allow the academic nomad to range that far.

So what additional technologies are needed? In the neolithic, there was a critical second step in the domestication of animals, one of major importance to the development of nomadic pastoralism. This was the so-called Secondary Products Revolution, an idea due to Andrew Sherratt on the importance of non-meat animal products like milk, wool and labor. The importance of this to pastoralists is the ability to glean small profits from their animals during the long and expensive process of raising them to maturity. Eventually, many animals came to be raised primarily for their secondary products.

The importance of this to nomads should be obvious. Nomads use animals to gather the sparse produce of marginal zones. If they were to eat entirely from the meat of their animals or by exchanging animals for other food and goods, pure pastoralists would need truly enormous numbers of animals to survive. These animals in turn would require even more enormous pastures. The feasibility of raising large numbers of animals exclusively for meat by specialists (i.e. people who didn't also farm) is almost certainly limited to the modern CAFO, which, by the way is a density-based (not mobility-based) strategy.

Secondary products, however, allow nomads to survive off their herds for long periods of time without reducing their numbers by primarily consuming their milk and blood. In most modern nomadic societies milk and blood products - combined with grain foods obtained by trade - are their staples. Meat is a special-occasion food. This was almost certainly the case historically as well.

Essentially, secondary-products give a way of storing animal-based food energy. Meat cannot be stored without salting or refrigeration. Grain can easily be stored, but not easily transported. Storing meat on the hoof is not a great option because you have to keep feeding it. But once you can get milk from that hoofed-meat while it is in storage, it becomes a desirable option. More importantly, these milk-producing meat-storage units can move around. They're called cows.

In other words the secondary products revolution gave herders an energy storage option that was, in many ways, superior to grain. It was still impossible to store large numbers of livestock in one place (because they would eat all the grass), so grain remained the preferred "density" strategy. But it made specialized herding a plausible "mobility" strategy.

So how does this metaphor extend to academia? I would argue that standard research is a lot like growing grain - you sit in one place and plant a bunch of seeds (read other people's books), water them (attend seminar) and then you harvest them (write your own book). Other people can then take your seeds (book) and plant it in their field to grow some books of their own. Search-based research methods are more like hunting or herding - rather than working a known patch of soil, you cast a wider net (ok, it's also kinda like fishing) and pull in sparse resources from a large area. Here is where the metaphor breaks down however, because in our transitional generation we have been writing books. That's kinda like turning our meat (or fish) into grain. Or something.

The metaphor has broken, but I hope the point is made. If we, as mobility-choosing academics, want to range wider areas, it doesn't make a huge amount of sense for us to keep writing books. That's like killing all our sheep to sell at the farmer's market. Our mobility strategy has only taken us a small distance beyond the pale. What we need are secondary products. So what do milk-cows and yoghurt look like in the realm of information?

In short, we need technologies that continue to produce as long as they are fed, that give us years of milk instead of days of steak. This means developing databases that integrate with the material they describe. It means machine-learning. It means transferable publishing standards for methods as well as content. It means network methods and complex systems. Again, the food analogy is entirely worthless here because the idea is of recursive rather than iterative methods. It means any type of research that produces semi-autonomous systems of generating questions (that will generate answers when fed), rather than one-off conclusions.

This made sense in my head and a mess on the screen. Tell me what you think.

Friday, September 9

The Nomadic Academic, Part 1

Part 1: Choosing Mobility

 At its largest (and smallest) scale, history (and the end of prehistory) is the story of competition for power and resources. Individuals, families, states and empires have competed with each other for ten thousand years and more. Much of that story is about the different strategies employed in this competition: hunting vs. gathering, herding vs. farming, raiding vs. trading... These strategies represent different ways of accumulating resources, and thereby power. 

Early on, the critical resource was food - enough to feed the individual and the small group. Families with better strategies for their time and place had more food, and were more likely to thrive. At the beginning of history, the neolithic revolution - domestication of plants and animals - had made food relatively plentiful; people (and to a lesser degree large animals like horses) became the critical resource. States that could control more people (and horses) could win wars, build pyramids and leave records for us to read. Arguably, this continued until the advent of modernity and the arrival of fossil fuels. Coal and oil outweigh human power by enormous factors, and have changed global dynamics - powerful states were now the ones that could control fossil fuel, and later nuclear power.

In each of these periods, there were many different strategies for accumulating food, people and oil. However, we can generally divide these into mobile strategies and sedentary strategies. Hunting requires covering more ground than gathering, and is riskier, but the reward is a big bunch of good food at once. Farming focuses on extracting as many nutrients from a limited space as possible, herding makes up for sparse natural wealth by ranging animals across greater areas. The escalation of these strategies - nomadism and irrigation further differentiate these. Nomads, by covering even more ground can survive in areas of even sparser biomass. Irrigation, as well as fertilizer, pesticides and other farming inputs enable even larger crops out of a small area. There are intermediary strategies - hybrid farming-herding, swidden farming and so on - but these represent the two basic models: stay in one place to concentrate returns, or move faster and further to cover more ground.

The empires that formed on the basis of these divergent strategies looked vastly different. On the one hand, places like China built huge population densities and stored tons of grain. They built walls to keep people in as much as to keep raiders out. They ramped up investment in agriculture, and built standing armies. On the other hand, places like Mongolia remained sparse. People, sheep and horses ranged across large areas. Armies were generally temporary and made up for their inferior numbers with superior mobility. With a horseback army, leaders like Genghis Khan could take on ground-bound forces more than ten times the size, simply by avoiding fighting them all at once. Mobility vs. density. The balance of power between these strategies went back and forth.

Having covered the basic concept of mobility, let's return to the argument that inorganic power sources have changed the game in modernity - that human power is no longer the determinant of strong states. I don't buy that premise on the grounds that food and labor are still clearly important. There are certainly states that are powerful on the basis of their energy resources - Saudi Arabia, Canada... But most of the big powers in the world are populous - China, the US... More importantly, control of money and information has clearly become a major determinant of world power: England is a financial power, Germany a technological power...

So what have been the financial and technological strategies of these powers? For much of the so-called "Information Age" (which I would argue has origins far earlier than the computer), the sedentary strategy has dominated. Learning was kept in books in libraries, and money was kept in banks. This is where computers and the internet came in. Just as nomadism came after farming, the mobility strategy of information followed the sedentary one. Information has become diffuse, and we have begun to domesticate horses that allow us to range across it.

The earliest nomad-farmer battles of the information age came from the horse-breeders - the computer scientists. Hackers were the Scythians, the earliest barbarian raiders. Linus Torvalds and Richard Stallman are our Attila and Timurlane. Open-source software distributes the labor for its creation across hundreds of volunteers, just as nomadic armies were made up of thousands of part-time warriors/full time herders. Wikipedia does similar things (imperfectly) with general knowledge. In perfect irony, open-source hardware is now growing to include technology for building farm equipment. The increasing treatment of financial instruments as pure information has caused all kinds of chaos, and its hard to tell who the barbarians are...

Anyway, this is all a lead-in to what I hope will be an extended meditation. Academia is a largely feudal/bureaucratic institution built to house and control the distribution of knowledge. When the Universities of Bologna, Oxford, Salamanca and Paris, and for that matter the Hanlin Academy and the Library of Alexandria were founded, hand-copied books were the best we had. Knowledge beyond what a single person could create and remember had to be housed in sedentary locations, and it made sense to collect it. But for scholars, the printing press was our wheel, movable type our chariot, the computer our saddle and the internet our stirrup. So I ask, in the middle of our nomadic revolution, what would it look like for an academic to choose mobility?


Reading List
Deleuze and Guattari, A Thousand Plateaus
Own Lattimore, Inner Asian Frontiers of China
David Christian, "Inner Eurasia as a Unit of World History"