August 26, 2013 by Amanda Visconti

Bitcurator: Digital Forensics in the Archive

I've just started work on the Bitcurator project at MITH; to give you a sense of what I'm doing, here are my tweets explaining, defining, and relating this information science work to humanists (skip down for more on digital forensics ethics if you've already seen the tweets!):

.@Bitcurator: building an inexpensive, easy-to-use digital forensics toolkit for info professionals working w/born-digital collections.

— Amanda Visconti (@Literature_Geek) August 21, 2013

Bundle OS tools, design GUI & documentation for libraries/archives, add forensics to the digital archive workflow w/eye toward public access

What is digital forensics? Recovery & investigation of digital files & environment, e.g. emails & DOCs, hidden/erased/dated files, metadata

What are born-digital materials? Anything generated digitally, e.g.:

Emails btw author & editor, a laptop where a book was drafted, digital poem drafts, browser history from an author's surfing for inspiration

Q. Taking off my info hat & donning my literature one, what's interesting about digital forensics + born-digital collections for humanists?

A1. Context for authors' writing. As w/physical environs of an author, her digital environment: wallpaper, mp3s played while writing, emails

A2. Regaining access to creative or scholarly work "lost" on defunct types of digital storage like floppies and zip drives.

Read more about @Bitcurator at bitcurator.net, e.g. @pwolsen's [nicely illustrated] post on creating a digital forensics workstation

I'm working on web design/dev, usability, documentation, and community-building for the project.

Ethics for Digital Forensic Archival Work

I've been reading up on digital forensics in the archive, and one of the most interesting topics to me is how digital archivists explain the full possibilities of digital forensics work to potential donors—when an author brings an archive the laptop they used to draft a famous novel, how do archivists go about concisely but ethically explaining what kind of knowledge future researchers could potentially extract from their data? (I'm fully confident in the ethics of archive professionals; I'm just wondering how such risks are concisely explained and contrasted against potential future benefits of handing over a lot of an author's personal data. What do donor agreements for digital gifts look like? I did some research!) The limitless possibilities for working with all the data on an author's laptop fascinate me—with the author's drafts alongside her emails, music playlists, calendar schedules, and every other aspect of modern life people keep on their computers, you could potentially draw out a detailed illustration of that author's writing practices (did they write only in the middle of the night? how did email arguments with publishers affect the amount of creative work they were getting done?), or track down just when some crucial idea was first typed and saved (while listening to a certain album? after visiting certain pages on Wikipedia?).

At the same time, all this possibility for recreating the context of authorial processes feels a lot more creepy when it concerns people alive today, not some Victorian author whose letters, date book, and diary were recovered from an attic. Best practice dictates that developing a relationship of trust with the donor and thoroughly documenting their wishes in a written donor agreement can go a long way toward helping future archivists evaluate whether new data analysis techniques would be in keeping with the donor's privacy needs. Digital Forensics and Born-Digital Content in Cultural Heritage Collections, a white paper by Matthew Kirschenbaum, Richard Ovenden, and Gabriela Redwine, explores this ethical issue as well as others. In this paper, Cal Lee provided an impressive list of hidden forms of data lurking just in the Microsoft Office Suite:

Forms of data that may be hidden from users include information about the application used to create a document; authors, user names, organizational affiliations, and author history; comments; custom properties; database queries; embedded objects (OLE); Fast Save—that is, change history appended to the end of a file, rather than applied to the body of a document; the GUID (Globally Unique Identifier) for the originating computer; hidden cells, slides, and text (purposely hidden but then possibly forgotten); Outlook (e-mail) properties and routing slips; path information, including audio and video paths, author history, linked objects, printers, hyperlinks, and included fields; presentation notes; printer driver information; RSID (Revision Save ID), which differentiates changes from different editing sessions); tracked changes (added to PowerPoint and Excel in Office XP) versions; Visual Basic code, including macros and viruses (and the identity of the code’s creators); Web server information; and white text (on white background). (List provided by Cal Lee. Pages 45-46 in Digital Forensics and Born-Digital Content in Cultural Heritage Collections, a white paper by Matthew Kirschenbaum, Richard Ovenden, and Gabriela Redwine)

While it's important to get across the full extent of risk when handing over storage devices to research archives, it's also important to show the benefits of providing seemingly unimportant contextual data to future researchers. Digital Forensics mentions that Emory uses the phrase "data analysis" instead of "digital forensics", as becoming the subject of a forensics investigation sounds much less inviting than having the context of one's authorial process recreated and explored.

I looked over the donor agreements MITH created for Deena Larsen and Bill Bly's gifts to our collection of pioneering e-lit and digital writing context; Digital Forensics also linked to some online examples of donor agreements, including

  • The Paradigm Project (archives from major UK political parties)
  • OpenDOAR, (the Directory of Open Access Repositories) which offers a nice checkbox questionnaire to help you generate your own digital materials donor agreement
  • The Variable Media Questionnaire, an environment for recording descriptive information about born-digital materials, including interviews with creators that allow a case-by-case record of what authors identify as the significant properties of their creative digital work

This last project takes an interesting stance on digital preservation, helping archivists record how and if they want the object to be emulated or otherwise recreated in the future: "The variable media paradigm also asks creators to choose the most appropriate strategy for dealing with the inevitable slippage that results from translating to new mediums: storage (mothballing a PC), emulation (playing Pong on your laptop), migration (putting Super-8 on DVD), or reinterpretation (Hamlet in a chat room)."