FDLP

Web Archiving

Project Description

The FDLP Web Archive provides point in time captures of U.S. Federal agency websites, while preserving the functionality of the site to the extent possible. The aim is to provide permanent public access to Federal Agency Web content. GPO harvests and archives the websites with Archive-It, a subscription-based Web harvesting and archiving service offered by the Internet Archive.

Ways to Access the Archived Sites

Catalog of U.S. Government Publications

Archive-It Website

The archived websites can also be searched and accessed through the FDLP Web Archive Collection page on the Archive-It Web site.

Frequently Asked Questions

Why is GPO archiving Federal websites?

Federal websites have become an important way that agencies communicate information to the public. However, web content often appears or disappears without warning. Archiving these websites is part of fulfilling GPO’s mission to provide permanent public access to Government information. The content is made available in accordance with Title 44 of the U.S. Code.

How does Web archiving with Archive-It work?

Archive-It uses a combination of crawling tools they have developed, including Heritrix and Umbra to gather content. The crawler searches and captures an entire content rich website, creating a working facsimile of the site as it appeared on that day and time. This helps preserve the website content as it appeared at a particular point in time. After the first crawl, the website is then periodically re-crawled. In that process, the crawler searches and captures the entire website again, creating a new working facsimile of the website as it appeared on the day and time of the re-crawl. All the facsimiles of the website are then accessible through the Wayback Machine, the Internet Archive’s digital archive of the World Wide Web. Links to our content in the Wayback Machine are available in the CGP and on the FDLP Web Archive collection page on the Archive-It website.

Why Archive-It?

While there is other web archiving software available that could potentially meet some of our needs, a service like this that provides the technical support, training, storage, and user interface all in one package is ideal for our needs.

What is our workflow for harvesting and archiving using Archive-It?

    1. Determine if the website is in the scope of the FDLP.
    2. Determine if Archive-It is the best tool for archiving the website and achievability.
    3. Notify the agency of intent to harvest data from their website.
    4. Manually review the website to create a seed list of URLs to include in the crawl.
    5. Run and QA a test crawl looking for any out of scope or missing content.
    6. Run any additional test crawls as needed to ensure crawl will be effective and efficient in production crawl, usually multiple test crawls are done.
    7. Run and QA the production crawl.
    8. Run any necessary patch crawls.
    9. Create a record for the website in the CGP.

Steps 3-7 are repeated for each re-crawl of the website.

Where is the harvested data stored?

All the harvested data is stored on Archive-It’s servers.

Who owns the harvested data?

GPO owns all the harvested data. The data is in the public domain.

What file format is used?

Archive-It uses the WARC (Web ARChive) file format which conforms to ISO 28500:2009.

How is the FDLP Web Archive data backed-up?

The Internet Archive (the parent organization of Archive-It) creates and keeps a primary copy and a back-up copy that are stored at the Internet Archive Data Center.

Can Archive-It capture and play back video?

Videos in more simple formats such as WMV or MPEG4 can easily be captured and played back, however it can vary with more complex formats. Archive-It crawling technology has the ability to capture videos in other formats and platforms, such as Flash or Vimeo, however playback can vary due to the complexity of the make-up or how the video is embedded on a page. Archive-It is continuously working to improve video playback for all formats and platforms and we regularly see enhancements being made.

What Federal websites is GPO archiving?

We started with websites in the Y3 SuDocs class. The Y3 class is used for commissions, committees, and independent agencies. We started with this class stem because many of these agencies, commissions, and committees were disseminating information solely through their websites.

Now, we are harvesting and archiving other sites. We are looking at agencies to determine the number of online publications they produce, and we plan to move from the smaller sites to the larger sites. Additionally, we may harvest websites based on special request or if there is high interest in a website because, for example, it concerns a current topic of interest or the agency is doing work that is currently in the public eye.

Are there any Federal websites GPO will not archive?

In an attempt to avoid duplicative effort, we are not harvesting or archiving anything in GPO’s govinfo, anything already archived by other Archive-It partners, or anything already archived by our FDLP partners who are digitizing specific content from their FDLP collections (FDLP Partnerships). We are also not harvesting anything outside the scope of the FDLP.

Additionally, some websites, such as databases or websites where content is generated “on the fly” by a content management system, cannot be properly harvested or archived by Archive-It. In these instances, we try to create partnerships with the providing agencies to ensure permanent public access to their web content.

Can I recommend a Federal website to be archived?

Absolutely! Please contact us through askGPO or email us at This email address is being protected from spambots. You need JavaScript enabled to view it. to let us know.

After a website is harvested and archived, how frequently is it re-crawled for new content?

In the beginning our focus was on building the web archive, however then we had to shift our efforts to maintain and enhance what we had built. After a crawl is complete the site is analyzed to determine frequency of updates. The frequency of the re-crawl is then assigned according to how often the site is updated, either annual, biannual, or quarterly. We do not automatically run re-crawls, but follow a workflow very much like we do for any new site. For all re-crawls the site is fully analyzed, to evaluate if there are any changes to it, if the seed list needs to be updated, or if any new modifications need to be made before the new crawls are run.

When GPO harvests the websites, does GPO ever get copyrighted, non-government, or other extraneous material?

The test crawl function of Archive-It eliminates this problem. If a test crawl brings back undesired results, you can modify the seed list accordingly to make sure resources aren’t wasted capturing unwanted material during the actual Archive crawl.

How does GPO handle copyrighted, proprietary or Personally Identifiable Information (PII) in an archived website?

Due to the fact that our main collection development practice is to archive content that would traditionally be included in the Federal Depository Library Program, we only seek to archive Federal Government information that is publically available. We would never intentionally harvest any copyrighted, proprietary or PII. If you suspect that we have unintentionally harvested any such content, please contact us at This email address is being protected from spambots. You need JavaScript enabled to view it. and provide us with the information, including the Wayback URL, and we will review this content for possible removal following Superintendent of Documents policy.

At what level does GPO catalog the websites?

The granularity of the cataloging depends on the content of the website. Some websites warrant individual catalog records for individual seeds of the website, while other websites only need one record that links to the full site.

If I think that a harvested website needs more granular cataloging, can I suggest that?

Absolutely! Please contact us through askGPO or email us at This email address is being protected from spambots. You need JavaScript enabled to view it. to let us know.

Does GPO update the catalog/bibliographic record every time a website is re-crawled?

After a re-crawl, we look at the website to see if there were any major changes to the content. If there were major changes, we update the record.

Are websites in the FDLP Web Archive given SuDocs classes and item numbers?

Yes. The websites are classified under the agency’s general publications category from the list of classes, and then INTERNET is added to the end of the class. The archived websites are assigned the regular item number that accompanies the general publications class for each agency.

For example, the SuDoc class for “NARAtions: the blog of the United States National Archives” is AE 1.102:INTERNET, and the Item Number is 0569-B-02 (online).

How can FDLP libraries obtain the bibliographic records in MARC format for the websites in the FDLP Web Archive?

Because records for the websites are created in the normal workflow, libraries can obtain these records the same way they obtain all their other CGP records.

Why are records for the FDLP Web Archive cataloged?

These websites are cataloged in the CGP because they are in scope of GPO’s Cataloging and Indexing Program, which aims to develop a comprehensive and authoritative national bibliography of U.S. Government publications, to increase the visibility and use of Government information products, and to develop a premier destination for information searchers.

What do Archive-It’s error messages mean?

There are two kinds of error messages that users might encounter:

    1. “Not in Archive”– This means the content underlying the selected link was never captured.
    2. “Error”– This most likely suggests a problem with the local media player software, and the user should check to see if his/her software needs updating.

What’s the difference between the FDLP Web Archive and GPO’s other archiving and cataloging tools?

GPO’s Permanent Server (Permanent) is our longest-standing archive and the basis for the bulk of the FDLP Electronic Collection. Permanent is used to store archived versions of monographs, serials, and some video and audio recordings. All the publications on Permanent are manually harvested by GPO staff, or by an archivist’s prompt to Teleport Pro, a robotic crawler that captures simpler formats, such as PDF and HTML files.

govinfo is comprised of deposited content ingested by agreement with agencies. govinfo includes documents from all three branches of Government and includes the majority of the Congressional publications that we catalog. Unlike Permanent, with govinfo users can directly search and browse content, as well as access the metadata in XML format.

The Federal Depository Library Program Web Archive is used exclusively for permanent access to entire Federal agency websites. The web archive was created through a partnership between GPO and Archive-It. The content is gathered using the robotic crawler Heritrix, which searches and captures entire content-rich websites. The harvested sites are then stored on Archive-It’s servers. The archived sites are searchable and accessible through the CGP as well as the Archive-It website.

The CGP presents no full text, but rather MARC records with PURLs that link to digital content stored on any of these repositories.

Webinars

GPO has conducted a number of webinars and presentations which offer more information about the FDLP web archive.

Contact Us

For question or suggestions about the FDLP Web Archive, please contact us at This email address is being protected from spambots. You need JavaScript enabled to view it..

 

News & Events

April 14, 2021
The Spring 2021 Depository Library Council Virtual Meeting (April 21) will include two Open Forums: Collection and Discovery... Read More ...
April 13, 2021
This message has been cross-posted. Please excuse any duplication. Handouts, slides, and the detailed agenda for the Spring 2021 DLC... Read More ...
April 13, 2021
This message is being posted on behalf of the REopening Archives, Libraries, and Museums (REALM) project. The REALM project is... Read More ...
April 13, 2021
Register to attend the live training webinar, "askGPO Update." Tuesday, May 18, 2021Time: 2:00 p.m. – 3:00 p.m. (EDT)   Recording and... Read More ...
April 12, 2021
Register to attend the live training webinar, "MarcEdit for Beginners." Tuesday, May 11, 2021Time: 2:00 p.m. – 3:00 p.m. (EDT)... Read More ...
View All

autism-awareness

April is Autism Acceptance Month. According to the National Autism Association, Autism affects 1 in 59 children. Visit the CGP for these related resources:

Serials Processing Working Group

If you’re spring cleaning, please keep GPO’s Needs List in mind. We are actively seeking volumes of the U.S. Congressional Serial Set for our collaborative digitization project with the Law Library of Congress.

StatueOfLibrary

The Superintendent of Documents is seeking comments from the community and the general public on various documents, including policy. Learn more.

polio vaccine

Did you know that a new edition of ‘Women in Congress’ is now available as an eBook through GPO’s govinfo? Learn more.

Get Connected to GPO

  • facebook
  • YouTube
  • Twitter
  • booktalks
  • pinterest
  • instagram
  • linkedin30px2