Web Archiving

Project Description

The Federal Depository Library Program (FDLP) Web Archive is comprised of selected U.S. Government Web sites, harvested and archived in their entirety by the U.S. Government Publishing Office (GPO) in order to create working “snapshots” of the Web sites at various points in time. The aim is to provide permanent public access to Federal Agency Web content. GPO harvests and archives the Web sites with Archive-It, a subscription-based Web harvesting and archiving service offered by the Internet Archive.

Ways to Access the Archived Sites

Catalog of U.S. Government Publications

  • Bibliographic records for the archived Web sites, which describe the sites and link to them via PURL (Persistent URL), are searchable and accessible through the Catalog of U.S. Government Publications (CGP).
  • To view all of the records in the FDLP Web Archive, search for the term “webarch” from the Basic search page.

Archive-It Website

The archived Web sites can also be searched and accessed through the FDLP Web Archive Collection page on the Archive-It Web site.

Frequently Asked Questions

Why is GPO archiving Federal Web sites?

Federal Web sites have become an important way that agencies communicate information to the public. However, Web content often appears or disappears without warning. Archiving these Web sites is part of fulfilling GPO’s mission to provide permanent public access to Government information. The content is made available in accordance with Title 44 of the US Code.

How does Web archiving with Archive-It work?

Archive-It uses a robotic crawler called Heritrix to gather content. Heritrix searches and captures an entire content rich Web site, creating a working facsimile of the site as it appeared on that day and time. This helps preserve the Web site content as it appeared at a particular point in time. After the first crawl, the Web site is then periodically re-crawled. In that process, Heritrix searches and captures the entire Web site again, creating a new working facsimile of the Web site as it appeared on the day and time of the re-crawl. All the facsimiles of the Web site are then accessible through the Wayback Machine, the Internet Archive’s digital archive of the World Wide Web. Links to our content in the Wayback Machine are available in the CGP and on the FDLP Web Archive collection page on the Archive-It Web site.

What’s the FDLP workflow for harvesting and archiving Web sites with Archive-It?

    1. Determine if the Web site is in the scope of the FDLP.
    2. Decide if Archive-It is the best way to capture and archive the site.
    3. Notify the agency of the harvesting.
    4. Manually review the Web site to create a seed list of URLs to include in the crawl.
    5. Run and QA a test crawl looking for any out of scope or missing content.
    6. Run and QA the Archive crawl.
    7. Run any necessary patch crawls.
    8. Create the catalog record for the Web site.

Steps 3-7 are repeated for each re-crawl of the Web site.

Where is the harvested data stored?

All the harvested data is stored on Archive-It’s servers.

Who owns the harvested data?

GPO owns all the harvested data. The data is in the public domain.

What file format is used?

Archive-It uses the WARC (Web ARChive) file format which conforms to ISO 28500:2009.

How is the FDLP Web Archive data backed-up?

The Internet Archive (the parent organization of Archive-It) creates and keeps a primary copy and a back-up copy that are stored at the Internet Archive Data Center.

Can Archive-It capture and play back video?

Videos in more simple formats such as WMV or MPEG4 can easily be captured and played back, however it can vary with more complex formats. Heritrix has the ability to capture videos in other formats and platforms, such as Flash or Vimeo, however playback can vary due to the complexity of the make-up or how the video is embedded on a page. Archive-It is continuously working to improve video playback for all formats and platforms and we regularly see enhancements being made.

What Federal Web sites is GPO archiving?

We started with Web sites in the Y3 SuDocs class. The Y3 class is used for commissions, committees, and independent agencies. We started with this class stem because many of these agencies, commissions, and committees were disseminating information solely through their Web sites.

Now, we are harvesting and archiving other sites. We are looking at agencies to determine the number of online publications they produce, and we plan to move from the smaller sites to the larger sites. Additionally, we may harvest Web sites based on special request or if there is high interest in a Web site because, for example, it concerns a current topic of interest or the agency is doing work that is currently in the public eye.

Are there any Federal Web sites GPO will not archive?

In an attempt to avoid duplicative effort, we are not harvesting or archiving anything in GPO’s Federal Digital System (FDsys), anything already archived by other Archive-It partners, or anything already archived by our FDLP partners who are digitizing specific content from their FDLP collections (FDLP Partnerships). We are also not harvesting anything outside the scope of the FDLP.

Additionally, some Web sites, such as databases or Web sites where content is generated “on the fly” by a content management system, cannot be properly harvested or archived by Archive-It. In these instances, we try to create partnerships with the providing agencies to ensure permanent public access to their Web sites.

Can I recommend a Federal Web site to be archived?

Absolutely! Please contact us through askGPO or Document Discovery, or email us at This email address is being protected from spambots. You need JavaScript enabled to view it. to let us know.

After a Web site is harvested and archived, how frequently is it re-crawled for new content?

In the beginning our focus was on building the collection, and now we have increased our efforts to maintain and enhance what we have built. After a crawl is complete the site is analyzed to determine frequency of updates. The frequency of the re-crawl is then assigned according to how often the site is updated, either annual, biannual, or quarterly. We do not automatically run re-crawls, but follow a workflow very much like we do for any new site. For all re-crawls the site is fully analyzed, to evaluate if there are any changes to it, if the seed list needs to be updated, or if any new modifications need to be made before the new crawls are run.

When GPO harvests the Web sites, does GPO ever get copyrighted, non-government, or other extraneous material?

The test crawl function of Archive-It eliminates this problem. If a test crawl brings back undesired results, you can modify the seed list accordingly to make sure resources aren’t wasted capturing unwanted material during the actual Archive crawl.

How does GPO handle copyrighted, proprietary or Personally Identifiable Information (PII) in an archived website?

Due to the fact that our main collection development practice is to archive content that would traditionally be included in the Federal Depository Library Program, we only seek to archive Federal Government information that is publically available. We would never intentionally harvest any copyrighted, proprietary or PII. If you suspect that we have unintentionally harvested any such content, please contact us at This email address is being protected from spambots. You need JavaScript enabled to view it. and provide us with the information, including the Wayback URL, and we will review this content for possible removal following Superintendent of Documents policy.

At what level does GPO catalog the Web sites?

The granularity of the cataloging depends on the content of the Web site. Some Web sites warrant individual catalog records for individual seeds of the Web site, while other Web sites only need one record that links to the full site.

If I think that a harvested Web site needs more granular cataloging, can I suggest that?

Absolutely! Please contact us through askGPO or email us at This email address is being protected from spambots. You need JavaScript enabled to view it. to let us know.

Does GPO update the catalog/bibliographic record every time a Web site is re-crawled?

After a re-crawl, we look at the Web site to see if there were any major changes to the content. If there were major changes, we update the record.

Are Web sites in the FDLP Web Archive given SuDocs classes and item numbers?

Yes. The Web sites are classified under the agency’s general publications category from the list of classes, and then INTERNET is added to the end of the class. The archived Web sites are assigned the regular item number that accompanies the general publications class for each agency.

For example, the SuDoc class for “NARAtions: the blog of the United States National Archives” is AE 1.102:INTERNET, and the Item Number is 0569-B-02 (online).

How can FDLP libraries obtain the bibliographic records in MARC format for the Web sites in the FDLP Web Archive?

Because records for the Web sites are created in the normal workflow, libraries can obtain these records the same way they obtain all their other CGP records.

Why are records for the FDLP Web Archive cataloged?

These Web sites are cataloged in the CGP because they are in scope of GPO’s Cataloging and Indexing Program, which aims to develop a comprehensive and authoritative national bibliography of U.S. Government publications, to increase the visibility and use of Government information products, and to develop a premier destination for information searchers.

What do Archive-It’s error messages mean?

There are two kinds of error messages that users might encounter:

    1. “Not in Archive”– This means the content underlying the selected link was never captured.
    2. “Error”– This most likely suggests a problem with the local media player software, and the user should check to see if his/her software needs updating.

What’s the difference between the FDLP Web Archive and GPO’s other archiving and cataloging tools?

GPO’s Permanent Server (Permanent) is our longest-standing archive and the basis for the bulk of the FDLP Electronic Collection. Permanent is used to store archived versions of monographs, serials, and some video and audio recordings. All the publications on Permanent are manually harvested by GPO staff, or by an archivist’s prompt to Teleport Pro, a robotic crawler that captures simpler formats, such as PDF and HTML files.

FDsys is comprised of deposited content ingested by agreement with agencies. FDsys includes documents from all three branches of Government and includes the majority of the Congressional publications that we catalog. Unlike Permanent, with FDsys, users can directly search and browse the archived content, as well as access the metadata in XML format.

The Federal Depository Library Program Web Archive is used exclusively for permanent access to entire Federal agency Web sites. The web archive was created through a partnership between GPO and Archive-It. The content is gathered using the robotic crawler Heritrix, which searches and captures entire content-rich Web sites. The harvested sites are then stored on Archive-It’s servers. The archived sites are searchable and accessible through the CGP as well as the Archive-It Web site.

The CGP presents no full text, but rather MARC records with PURLs that link to digital content stored on any of these repositories.


GPO has conducted two webinars which offer more information about the FDLP Web Archive:

Contact Us

For questions or suggestions about our Web archiving program, contact us at This email address is being protected from spambots. You need JavaScript enabled to view it..