DIGITAL: Block on Topic Modeling in Common-place (from 2006)

Sharon Block, “Doing More with Digitization: An Introduction to Topic Modeling of Early American Sources,” Common-Place 06, no. 02 (January 2006), http://www.common-place-archives.org/vol-06/no-02/tales/.

“In the 1990s, for a research project on colonial sexual coercion, I read hundreds of microfilmed early American newspapers for references to rape trials. Partway through my research, Accessible Archives released CD-ROMs of the Pennsylvania Gazette that were fully text searchable. I remember waiting eagerly for the processing of each new Gazette folio so that a simple keyword search could generate a list (printed on a dot-matrix printer) of all occurrences of the word rape in that folio’s years.

“A mere decade later, I am disappointed when an article or source is not available at the stroke of a few keys. Despite justified concerns over this exploding technology (Who will have access to these documents? Does the digitization of select documents re-privilege certain kinds of history?), historical-document digitization has enormously expanded research capabilities. What used to require months of searching can now be accomplished in an afternoon.

“Yet having this material available electronically and fully searchable has created some new problems. Already we are seeing the limitations of keyword searching. In any given set of documents, some keywords are too broad in their meaning, some are too narrow, and others have too many different meanings. The results of keyword searches are quite often incomplete or full of “noise,” irrelevant results that make it hard to find what you are looking for. For searching to be effective, access needs to be supplemented by analysis.

“One promising way to move beyond keyword searching is to program computers not only to find words in these huge document collections but also to analyze documents by grouping them in subject classifications. This would provide a comprehensive indexing system that would far surpass human-indexing capabilities, but, even better, it would give scholars a complete picture of how their particular subject related to other subjects in that collection of documents: How much relative print space did the colonists give to discussions of Indians, of crime, or of politics? How did the entire contents of an eighteenth-century newspaper change over time?”

Read the rest: Common-place: Tales from the Vault

Join the Discussion

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s