skip navigation
  • Ask a LibrarianDigital CollectionsLibrary Catalogs
  •  
The Library of Congress > Preservation > Resources > Recommended Formats Statement
Preservation
  • Preservation Home
  • About
  • Collections Care
  • Conservation
  • Digital Preservation
  • Emergency Management
  • En Español
  • FAQ
  • Preservation Science
  • Resources
  • Outreach & Training Opportunities
  • Have a preservation question?
    Ask-a-Librarian

Related Links

  • Donate
  • Blog: Guardians of Memory, Preserving the National Collection
  • Audio-Visual Preservation
  • National Film Preservation Board
  • National Recording Preservation Board

Recommended Formats Statement


{ subscribe_url: '/share/sites/Bapu4ruC/preservation.php' }
« Back to Recommended Formats Statement
Main | Table of Contents | Introduction | FAQ | Summary of Digital Format Preferences | Textual Works | Still Image Works | Moving Image Works | Audio Works | Musical Scores | Datasets | GIS, Geospatial and Non-GIS Cartographic | Design and 3D | Software and Video Games | Web Archives | Email

X. Web Archives

This format specification covers the Library’s preferred format for archived web content or web archives. The Library is aware that websites, including blogs, social media and other web content that make up websites, are presented and created in formats for viewing in a web browser, and are often different than the standard format that is recommended for preservation and long-term access. Given that the focus of this document is preservation and long-term access, the following format preferences favor those outcomes. For information on best practices to better enable preservation of web content, please visit the Library of Congress Web Archiving Team’s recommendations on creating preservable websites.

i. Web Archives
i. Websites
  Preferred Acceptable
A. Formats

The Library, and other organizations involved in web archiving, are preserving web content in the Web Archive (WARC) format using record-at-a-time GZIP compression, as described in Appendix A of the WARC Standard.

  • Internet Archive's ARC_IA format, a precursor to the WARC format
  • Web Archive Collection Zipped (WACZ), as used in the Webrecorder project 
  • CDX as a component file for WARC file content
B. Delivery Method

Capture using tools that produce non-proprietary output, to conform with standard formats and requirements

Transmission of WARC or ARC_IA files created by web content producers or other archiving organizations

C. Metadata
  1. Refer to the WARC ISO-standard specification for mandatory and recommended metadata fields
  2. When displaying archived content, the following should be clearly indicated:
    1. archiving institution,
    2. dates and time of capture,
    3. statements about functionality within the archive to distinguish from the live site

The ARC_IA should be named in a manner that easily identifies the archiving institution (see WARC standard for recommended naming conventions)

D. Technological Measures

Tools currently available cannot capture all web content, so certain types of web content may not be preservable through web capture at this time. These include:

  • Multi-media rich content
  • Streaming media
  • Deep web content
  • Databases
E. Referencing

Web materials in any web archive can be referred to persistently using the URN Namespace Registration for Persistent Web IDentifiers (PWID).

Back to Top

Stay Connected with the Library All ways to connect »

Find us on

PinterestFacebookTwitterYouTubeFlickr

Subscribe & Comment

  • RSS & E-Mail
  • Blogs

Download & Play

  • Podcasts
  • Webcasts
  • iTunes U 
About | Press | Jobs | Donate | Inspector General | Legal | Accessibility | External Link Disclaimer | USA.gov | Speech Enabled Download BrowseAloud Plugin