Overview

Identification of file formats is crucial to any preservation strategy. Unfortnately, there are no universal standards for identification. The primary techniques used are suffix analysis (e.g. .doc, .ppt, ...) and "magic number" analysis. Even once a file type is identified, there may be context specific issues to evaluate; for example, office documents may use fonts that are not embedded in the document.

Format Information

Short Assignment

Each student should study one common format (suggestions wordperfect, lotus 1-2-3, ArcView, excel, powerpoint, pdf, ...) to answer the following questions:

  • What information is publicly available about the format ?
  • Can files in this format be reliably identified ?
  • What differences exist between versions ?
  • What embedded or linked resources need to be considered in preserving files in this format ?
  • What open source libraries and conversion tools are available ?
  • Are there free viewers available for this format ?

We will need to negotiate on format choices to ensure broad coverage.

You should prepare a 10 minute presentation on your findings.

Readings

magic number format