July 11, 2012 by Amanda Visconti

View DHQ: Getting Started with Gephi for DH Vis (Part II of II)

View the site related to this post: digitalliterature.net/viewDHQ. Huge thanks to the ACH for funding this work!

In the first part of this blog post, I discussed the results of a small visualization project funded by a 2012 ACH Microgrant: visualizing the flow of DH knowledge as captured by citation networks and other measures from Digital Humanities Quarterly (DHQ). You can view some of the products of the project here, read more about the initial proposal here, or jump back to Part One of this post to read more about the project's outcome. Now, I'll discuss some of the technical details of creating DH visualizations...

Getting Started with Gephi

Beta. Gephi is free, open-source visualization software for Mac, Windows, and Linux available from Gephi.org. Its beta status (currently 0.8.1-beta) meant I periodically ran into some small bugs (usually fixed by deleting the user library and reinstalling); the largest beta-style issue was that visualization images weren't exporting as editable files (e.g. SVGs), meaning I couldn't put on any final touches to the images using design software like Adobe Illustrator.

Basic Terminology. Outside these minor issues, I found it quite easy to start making basic visualizations with Gephi. As I began working with various datasets, I found my workflow fell into two parts:

  1. Thinking about what I'm trying to display--either specifics ("I want to show how many steps it takes to connect Jean Valjean to any other character in Les Miserables"), or a general sense of what kind of interesting things you're hoping to see pop out ("How is knowledge being cited and reused in DHQ? What are some of the most frequently cited writings in DH?")--and structuring my dataset appropriately. What do you want to show, and what data must you record to model that idea?

    DHQ Stack IssueWe worried that DHQ's 110 articles would show little interconnectedness, so we planned on adding additional measures to the visualizations.

  2. Importing the dataset into Gephi and working with the data through the visual interface--either experimentally ("Let's try some different layouts and filters see if anything interesting pops up!") or with a predetermined purpose ("I want to determine the five most frequently cited works in DHQ's citation set, then display how all the other articles and citations are webbed among these key five nodes").

Gephi can handle complex data--e.g. a dynamic stream of Twitter hashtag results--but a dataset can also be a ridiculously simple thing, so this article will assume we're all working with some pretty basic sets of data. Some quick terminology might help:

  • Think of a visualization as a bunch of objects connected by links--circles and lines, at the most simple level. The objects are called nodes and the links are called edges. Each visualization dataset will model various nodes and the way they are connected via edges.
  • Edges can be given weights, which are just some kind of quantitative measure that tells you something about each given edge between two nodes. One possible measure you could use to weight your edges is degree, which is the number of edges connecting to a node, representing how interconnected that node is.
  • Visualizations can be directed or undirected; directed edges only point one way (from node A -> node B but not also B -> A), while undirected edges point both ways. A dataset that records each and every character interaction in a book might choose to use directed edges to connect character nodes if it were recording interactions such as thinking about or overhearing someone, while a similar dataset that only recorded interactions that are dialogues could be undirected, since the interaction modeled by the edge goes both ways.

My First Sony Dataset

For the visualizations with DHQ and with Ulysses I've worked on recently, my datasets have largely been two- or three-column spreadsheets showing the relationship between a pair of items, plus sometimes a weight (i.e. some quantified measure of that relationship such as number of times a pair of characters meet). For the Ulysses "Wandering Rocks" visualization, we simply had one column for the "source" (the node for a person leading the interaction) and one column for the "target" (the focus of that interaction). Thus, one row in a general Ulysses interaction dataset might have a source cell saying "Bloom" and a target cell saying "Molly", to record a social interaction where the character Bloom spoke to Molly. Make sure that any nodes (your characters, articles, objects, etc.) that are mentioned multiple times within a dataset are always identified in the same way with the exact same spelling so that Gephi can identify each mention as referring to the same thing (e.g. we always referred to one character as Fr. Conmee, never sometimes as Father Conmee).

Ulysses Wandering Rocks Dataset SnapshotA snapshot of part of the super-basic dataset spreadsheet for the "Wandering Rocks" visualization.

For the first three DHQ visualizations, I had "source", "target", and "weight" columns, where each row recorded data about one DHQ article (the source), one of the articles in its bibliography (the target), and a weight for that edge (we decided to weigh edges by the number of times an item in an article's bibliography was explicitly cited in the main text of that article).

Once you've set up a dataset (and there are other ways to do this, but for a simple dataset, importing a spreadsheet as a CSV is probably the easiest)--you're ready to check out your data in Gephi. Fire up the program and note the three main area tabs at the top of the interface:

  • Overview is where you'll work with your dataset in visual form, trying out different layouts, filtering for specific data, customizing colors and labels, etc.
  • Laboratory is where you can view a spreadsheet version of your dataset; when viewing a visualization image in the Overview, for example, you might want to right-click and select "View in Data Laboratory" to see the specifics of a given node. Laboratory is also where you'll want to head to import your dataset ("Import Spreadsheet"), following these instructions if you need a little extra help. Once you've imported your dataset successfully, you can jump back to the Overview view to start playing with your data.
  • Preview is the pane for seeing how your final visualization looks to the world, allowing you to make final customizations (e.g. background and label coloring) and choose how to export your visualization (as noted above, .PNG seems to be the current best option for Mac users).

What's Next?

Circular visualization of Les Miserables character interconnectionsA quick visualization of character relations in the Gephi Les Miserables dataset.

The best way to learn Gephi is to import a dataset with which you're familiar; if you don't currently have one of your own, try opening the one of Gephi's example datasets that is most familiar to you (perhaps it's the connections between characters in Victor Hugo's novel Les Miserables, or a network of nearly all (?) the Marvel superheroes). I'll cover a few of the many buttons and widgets below, but the best way to learn is probably to

  • begin with Gephi's three basic tutorials,
  • check out Elijah Meeks and Molly Wilson’s beautiful work with visualizing DH work at Stanford for an aspirational example,
  • and start trying out buttons in the Overview pane to see what they do (hover your cursor over most buttons for a helpful explanation of what each thing does)

Gephi maintains a forum and a wiki that probably hold answers to most of your basic questions; also check out the list of Gephi plugins, which add functionalities such as embedding zoomable visualizations in your web page and pulling data from Twitter.

Three Features to Explore

Once you've got your dataset imported and you're on the Overview pane:

  • Play around with layouts (bottom-left of page; select a layout form the drop-down menu, then hit "run" to begin the layout algorithm and "stop" when desired). Options include a circular layout (arranges nodes alphabetically around the circle's edges, as with the Les Miserables visualization above), the force-directed layout (which takes into account the tightness of interlinkages among nodes), expansion (increases space among nodes for better readability), and the node adjust and noverlap layouts (which also help your labels be readable).
  • Use filters to highlight specific parts of your dataset (upper-right corner). For example, you can filter out all but the most interconnected nodes by double-clicking Filters > Topology > Degree Range, then using the slider that appear at the bottom-right to choose for what range of degrees nodes should be displayed (you can also click the number at either end of the range and type in a number, if the slider isn't fine enough for your needs), then click "filter" to see the results. For example, we could use this filter on the Les Mis dataset to only show characters who interact with at least three other characters by setting the minimum degree to 3.
  • Adjust edge, node, and label size and coloring using the ranking options in the upper-left. For example, for the DHQ dataset (which weights edges by how many times a target citation is mentioned in the source node article) choose the edges tab underneath "ranking", then choose the parameter "weight". Use the tiny palette icon (upper-right of the pane, but below the drop-down menu) to choose a color scheme, the min/max fields to limit the sizes possible, and use the other small icons (upper-right of the pane, above the drop-down menu) to choose whether to effect edge color, label color, or node size. Click apply to see your changes.