Harvesting Pilot Team
- Category: Get to Know GPO
- Published: March 01 2013
In this day and age, more and more Federal agencies are publishing their information on the Web, rather than in print. While at one time, most Federal agency Web sites were simply places to post printable PDF documents, these sites now contain dynamic content such as video, JAVA, and a design structure that integrates information into the site itself. In response to the need to capture this content, Library Services and Content Management (LSCM) initiated the Web Harvesting Pilot. In order to expand GPO’s Web archiving efforts and ensure permanent public access to U.S. Government information, a team of LSCM staff members was assembled to:
- Design a Web archiving workflow that focuses on archiving facsimiles of Web sites
- Archive and preserve the content for public access and use
- Provide metadata and catalog records for content in the Catalog of U.S. Government Publications (CGP).
The Web Harvesting Pilot Team
David Walls, Preservation Librarian, 2.5 years with GPO
David Walls is the Chair of GPO’s task force that is pursuing a life-cycle management approach to harvesting, cataloging, and archiving Web content based on Heritrix and the Internet Archives Archive-It hosted Web archiving service.
Dory Bower, Archive Specialist, 2 years with GPO
Dory Bower identifies and selects Government Web sites for harvesting. She harvests those sites using the Heritrix Web harvester. Dory also is responsible for evaluating the harvesting workflow and developing quality assurance procedures and training.
Valerie Furino, Supervisory Librarian for Collection Development and Classification, 10 years with GPO
As the Pilot evolves and as LSCM implements the project full-scale, Valerie’s staff members will integrate this method of harvesting into their daily duties.
Fang Gao, Supervisory Librarian for Bibliographic Control, 2 years with GPO
Fang Gao, working closely with LSCM Cataloging Librarian, Liselle Drake, creates Dublin Core records for harvested Web sites. They create metadata for different granule levels: collection and seed levels, and the resulting records are RDA-compatible.
Stacey Kinsel, Technical Services Librarian, 2 years with GPO
Stacey Kinsel identifies and selects Government Web sites for harvesting. She also harvests those sites using the Heritrix Web harvester.
According to David Walls, “Using the Internet Archive’s Archive-It Web harvesting and hosting service allowed LSCM to create a more efficient workflow, while outsourcing the more expensive IT-based operation of the Heritrix Web harvester and the archiving and hosting of harvested content.”
Stacey Kinsel also commented that, “Visiting a Government Web site one day only to return the next to find a redesign, different structure, or missing content can be very frustrating. Web harvesting is important for maintaining permanent public access to Federal Government information.”
GPO is a member of the International Internet Preservation Consortium (IIPC), an active international group dedicated to Web harvesting using the Internet Archives Heritrix Web harvester. Through participation in the IIPC, the team also has been able to collaborate with peer institutions such as the Library of Congress and the National Archives and Records Administration on best practices and to coordinate the harvesting of the Federal Web to avoid duplication of effort.