Environmental Protection Agency (EPA) Web Publication Harvesting Pilot Project

Web Publication Harvesting Pilot Project White Paper

epa-logoGPO is pleased to announce the release of a white paper on the results of the completed Web Harvesting pilot project to capture official Environmental Protection Agency (EPA) publications in scope of GPO's information dissemination programs.

The white paper reports on the specific context of the results of the pilot, including a summary of analysis done on the work performed, an assessment of lessons learned, and planned future direction and next steps for further development of the harvesting function to be implemented during Release 2 of GPO's Future Digital System (FDsys).

As a first step in learning about automated Web publication discovery and harvesting technologies and methodologies, GPO contracted with two private companies on this pilot. We collaborated to develop rules and instructions that would determine whether EPA content discovered was in scope for GPO's dissemination programs. Three separate crawls were conducted on the sites over a six-month period, and harvester rules and instructions were refined and revised between crawls.

Automated publication harvesting was a topic of discussion at the spring 2007 Depository Library Council Meeting (see pdf session handout (209 KB) ).

Sample Publications from GPO's Web Harvesting Pilot

LSCM staff processed a sample of 300 publications harvested during the EPA Pilot Project. The purpose of working through this sample was to determine workflow and staffing implications as well as to estimate the amount of time that would be required to process all the publications acquired during the EPA Pilot Project.

LSCM tested two mechanisms for making the publications found to be within scope of the FDLP accessible. The majority of publications in the sample were made accessible through cataloging records in the Catalog of U.S. Government Publications (CGP). Monographs were cataloged using the new brief bibliographic record format, while serials were cataloged following the CONSER abridged standard.

Following the procedures established during the brief bibliographic records project, the brief records for the monograph publications included in the sample were created directly in the CGP and have not been exported to OCLC. Given the large number of monographs harvested during the EPA Pilot Project, the brief bibliographic records were not forwarded to the Cataloging Section for enhancement. To allow for an additional searching mechanism, an added entry for the Environmental Protection Agency was included in each record.

Currently, LSCM assigns PURLs to live content on the publishing agency’s Web site.  PURLs are only redirect to GPO’s archived copy if the live site is no longer available. As part of this project, LSCM is reconsidering this policy. While processing the sample, a portion of the PURLs were directed to the copy of the publication archived on GPO’s server rather than the live version.

At the request of the Depository Library Council, LSCM was also trying to determine if there is a mechanism that enables public access to Web harvested content while these publications are in the queue for brief bibliographic records. LSCM posted a small portion of the sample to GPO Access using a browse table. Publications made accessible through this mechanism were later cataloged in the CGP.

An analysis of the time required to process this sample from the results of the EPA harvesting pilot project is available pdf here (73 KB) .

To review the sample publications:

GPO appreciates the input from the 78 respondents who reviewed and submitted comments on the processing of the 300 publications from the results of the EPA Pilot Project. View a summary of the comments received.

News & Events

May 26, 2016
Beginning in July 2016, Public Health Reports (PHR) (HE 20.30:, item number 0497-A-01), the official journal of the U.S. Public... Read More ...
May 24, 2016
Many Federal depository libraries display the handy, “How to Locate a U.S. Government Publication” poster near their FDLP collection. It guides... Read More ...
May 24, 2016
A live training webinar, "CFPB: Your Money, Your Goals," will be presented on Wednesday, June 15, 2016. Register today for “CFPB:... Read More ...
May 19, 2016
The U.S. Government Publishing Office (GPO) is pleased to announce that an updated PURL Usage Tool is coming this June, with new... Read More ...
May 16, 2016
A live training webinar, "Increasing Veterans’ Access with eBenefits," will be presented on Thursday, June 16, 2016. Register today for... Read More ...
View All

ben small banner 1

Visit Ben's guide for free educational content on the workings of the U.S. Government and U.S. history.   

cover image for video

GPO launched www.govinfo.gov, ushering in a new, dynamic way for the public to discover and access Government information. View the video for more information.

Get Connected to GPO

  • facebook
  • YouTube
  • Twitter
  • booktalks
  • pinterest
  • instagram
  • linkedin30px2