Overview
Identification of file formats is crucial to any preservation strategy. Unfortnately, there are no universal standards for identification. The primary techniques used are suffix analysis (e.g. .doc, .ppt, ...) and "magic number" analysis. Even once a file type is identified, there may be context specific issues to evaluate; for example, office documents may use fonts that are not embedded in the document.
Format Information
- Common Format Types
- General source of info http://www.wotsit.org
- Word processing
- Spreadsheet
- Presentation
- Database
- GIS
- Image
- Audio
- Video
- quicktime
- mpeg
- Disk formats
- FAT
- ISO9660
- Compression Common standards
- Format Identification
- Suffix Analysis
- Magic Numbers
- Risks
- Case Study
- Powerpoint
Short Assignment
Each student should study one common format (suggestions wordperfect, lotus 1-2-3, ArcView, excel, powerpoint, pdf, ...) to answer the following questions:
- What information is publicly available about the format ?
- Can files in this format be reliably identified ?
- What differences exist between versions ?
- What embedded or linked resources need to be considered in preserving files in this format ?
- What open source libraries and conversion tools are available ?
- Are there free viewers available for this format ?
We will need to negotiate on format choices to ensure broad coverage.
You should prepare a 10 minute presentation on your findings.
Readings
- Survey and assessment of sources of information on file formats and software documentation http://www.jisc.ac.uk/uploaded_documents/FileFormatsreport.pdf
- Automatic Format Identification Using PRONOM and DROID http://droid.sourceforge.net/wiki/images/b/b4/Technical_Paper_1_-_Automatic_Format_Identification_v2.pdf
- JHOVE http://hul.harvard.edu/jhove/index.html, "Knowing what you've got"
- Digital Formats: Factors for Sustainability, Functionality, and Quality. Paper by Caroline R. Arms and Carl Fleischhauer for IS&T Archiving 2005 Conference, Washington, D.C. http://memory.loc.gov/ammem/techdocs/digform/Formats_IST05_paper.pdf
- Library of Congress page on formats http://www.digitalpreservation.gov/formats/intro/intro.shtml
- Smithsonian Recommendations for Preservation Formats
- Recommended Data Formats for Preservation Purposes in the FCLA Digital Archive
- Unix file() command for format identification http://www.darwinsys.com/file/,
- T. Reichherzer and G. Brown. "Quantifying Software Requirements for Supporting Archived Office Documents Using Emulation", Joint Conference on Digital Libraries (JCDL), 2006. http://doi.acm.org/10.1145/1141753.1141770.
- S. Abrams ... "Harvard's Perspective on the Archive Ingest and Handling Test" http://www.dlib.org/dlib/december05/abrams/12abrams.html
- Stephen Abrams "Knowing What You've Got: Format Identification, Validation, and Characterization" http://www.dcc.ac.uk/events/archives-2006/Abrams_LUCAS-2006.ppt


