Web Harvesting Pilot Project

As reported in the October issue of FDLP Connection, the Office of Archival Management (OAM) is responsible for harvesting and archiving Web-based Federal agency information for the Federal Depository Library Program (FDLP). Harvesting and archiving Web publications have been an integral part of our work since 1996.

Since then, the nature of Federal agency Web content has changed dramatically. No longer are Web sites just places to post reports and other printable documents. Today, they exhibit a diversity of file formats and a mixture of complex structures and content.

The exponential increase in Federal agency Web content, coupled with its changing nature, have created challenges for GPO's harvesting and archiving processes. OAM is looking for ways to leverage technology and further automate our Web harvesting and archiving efforts in order to better meet the increased need to acquire Federal online content for permanent public access.

We need to expand our current Web archiving efforts to be able to harvest working facsimile copies of Web sites, as much as current technology allows, and archive the harvested content for long-term preservation and access. In keeping with our commitment to provide access to the harvested content, catalog records with links to the harvested sites need to be created and included in the Catalog of U.S. Government Publications (CGP).

In late 2011, Library Services and Content Management (LSCM) and OAM staff developed a pilot project to test an implementation of the Internet Archive's Heritrix-based Archive-It, which is a subscription-based Web harvesting and archiving service. In developing the pilot project, the project team networked with Web harvesting teams from the Library of Congress, the National Archives and Records Administration, and the University of North Texas Library (a GPO library partner already well-known for establishing the CyberCemetery and its leadership in digital preservation initiatives).

While each of these GPO partners and more than 228 libraries and agencies had proven the basic concept and viability of Heritrix and Archive-It, the Web Harvesting Task Force was charged with determining whether Archive-It would work within LSCM's operational budget and staffing parameters.

Test crawls were conducted on ten test Web sites, and the resulting facsimile harvested copies were reviewed for performance. MARC records were created in the CGP by performing a crosswalk from the Archive-It Dublin Core metadata to MARC. Links in the CGP MARC records were created to the archived content on the Internet Archive's Wayback server for each harvested Web site.

Having successfully achieved the proof of concept, Laurie Hall, LSCM's Director of Library Technical Information Services, charged the Task Force to:

  • Form a Web Archiving Team and develop a project plan toward a full implementation of a Web harvesting and archiving service.
  • Develop modifications needed to LSCM workflow for acquisition, cataloging, classification, archiving, and access, to include whole Web sites as well as individual publications.
  • Develop configurations on cost and staff resources for continuation and expansion of the project, including a budget for FY2013.

Implementation of the Internet Archive's Archive-It Web harvesting and hosting service has allowed us to create a new acquisitions and processing model. It enables LSCM staff to focus on site selection, scope determination, metadata creation, and cataloging – activities that librarians and archivists do best. At the same time, outsourcing the more expensive IT-based operation of the Heritrix Web harvester and the archiving and hosting of harvested content to Archive-It. This model also provides us with a scalable program that can grow to meet increased harvesting demands.

Members of the new Web Archiving Team reported on the success of their work at the 2012 Depository Library Council Meeting and Federal Depository Library Conference. For fiscal year 2013, the Web Archiving Team is focusing on SuDoc Y Class content from Federal Agency Commissions, as well as special requests received through askGPO and Lost Docs to harvest sites.

The implementation of the Archive-It Web harvesting effort is just one of many tools in the harvesting of Web-based content that LSCM will use to harvest Web-based Government publications for the FDLP.

If you have questions or comments about the Web Archiving Team work or suggestions of Federal Web sites to harvest, please contact us at This email address is being protected from spambots. You need JavaScript enabled to view it..