Last week, I finished up a small project funded by a 2012 ACH Microgrant: visualizing the flow of DH knowledge as captured by citation networks and other measures from Digital Humanities Quarterly (DHQ). You can view some of the products of the project here, read more about the initial proposal here, or read about how to make DH visualizations in Part Two of this post.
This project proposed a visualization of DHQ's citation network with an eye toward identifying the flow and reuse of key texts in the DH community. Previous DH citation network research focused on exploring which journals include DH work, not on which individual DH works are cited within journals; this microgrant allowed me to conduct preliminary explorations with rendering the existing DHQ citation content more easily analyzed through visualizations, as well as make some recommendations for future DHQ visualization work in terms of improving metadata (an area that's also being more thoroughly tackled by this NEH-funded project on improving DH bibliographic data), creating a streamlined workflow for dynamically updating visualizations of DHQ content, and embedding visualizations on the DHQ site.
The work supported by this grant is mostly visible on the webpages at digitalliterature.net/viewDHQ, which traces my project of visualizing DHQ through five visualizations. These visualizations outline several ways of seeing an interesting dataset: one with much potential to reflect meaningful issues of knowledge reuse and representation within the DH community, but also presenting some issues in terms of relatively small size (110 articles), limited interconnectedness of citations, and metadata markup.
One of the visualizations produced during the project: who, when included in a work's bibliography, gets cited specifically a lot in the text of that article?.
All five visualizations relied on the full (as of April 2012) set of 110 DHQ articles, identified throughout by their XML IDs (e.g. dhq-000009) and scraped from the DHQ website by the code magic of MITH Assistant Director Travis Brown. The first three visualizations explore DHQ’s citation network through
1. providing an overview of the dataset
2. exploring the frequency of “namechecking” in DHQ (if a work is cited by a DHQ article, how many times is that work cited by name in the main text of the article?)
3. identifying the DHQ articles with the most interlinkages (how many other works cite it or are cited by it?)
The dataset for these first three visualizations identifies 2,473 linked pairings between the 110 DHQ articles and each of the works cited within them. Recording how many times a cited work was mentioned within the article citing it as a "weight" to these links (e.g. the dhq-000105 article mentions a 2010 article by Cohen twice--once in the bibliography and once in the main article text--thus receiving a weight of 2) adds richness to this data.
369 of these unique citations each only appear once in their DHQ article’s XML file (i.e. the work was mentioned in the bibliography but not in the main text, at least not with the correct XML ID tag). Most of the citations (2,458 of them) were mentioned in the main text and bibliography between 1-9 times, with only 15 articles with ten or more mentions in an article citing them; dhq-000093 (Molly Gage’s “Winesburg, Ohio: A Modernist Kluge”) was the extreme case, citing Sherwood Anderson’s Winesburg, Ohio 44 times (suggesting it might be interesting to compare citation "namechecking" between more traditional literary articles and newer DH work...).
The last two visualizations focus on the similarity among DHQ articles by
5. identifying those DHQ articles identified by the topic modeling process as having the highest degree of document similarity
These two visualizations used the 110 DHQ articles only (i.e. not the cited works as well, as in the first three images); the dataset used for these visualizations was produced by Travis Brown using the MALLET topic modeling package. This dataset identifies 5,996 linkages among DHQ articles (representing the linkages between each possible DHQ article pairing of two articles). Edge weights here represent degrees of document similarity among the DHQ articles; these weights record symmetrized KL-divergence, measuring the distance between probability distributions for each article such that the smallest divergence signifies articles with the closest topics. Check out Visualizations 4 and 5 for some thoughts on these images.
Not sure what an edge weight is? Want to see what a basic visualization dataset looks like? Read Part Two of this post for some beginner tips on creating DH visualizations with Gephi.
This post was a Digital Humanities Now Editor's Choice article on July 12, 2012.