- Last Updated: November 27 2015
- Published: July 21 2014
The Federal Depository Library Program (FDLP) Web Archive is comprised of selected U.S. Government Web sites, harvested and archived in their entirety by the U.S. Government Publishing Office (GPO) in order to create working “snapshots” of the Web sites at various points in time. The aim is to provide permanent public access to Federal Agency Web content. GPO harvests and archives the Web sites with Archive-It, a subscription-based Web harvesting and archiving service offered by the Internet Archive.
Ways to Access the Archived Sites
Catalog of U.S. Government Publications
- Bibliographic records for the archived Web sites, which describe the sites and link to them via PURL (Persistent URL), are searchable and accessible through the Catalog of U.S. Government Publications (CGP).
- To view all of the records in the FDLP Web Archive, search for the term “webarch” from the Basic search page.
The archived Web sites can also be searched and accessed through the FDLP Web Archive Collection page on the Archive-It Web site.
Frequently Asked Questions
- Why is GPO archiving Federal Web sites?
- How does Web archiving with Archive-It work?
- What’s the FDLP workflow for harvesting and archiving Web sites with Archive-It?
- Where is the harvested data stored?
- Who owns the harvested data?
- What file format is used?
- How is the FDLP Web Archive data backed-up?
- Can Archive-It capture and play back video?
- What Federal Web sites is GPO archiving?
- Are there any Federal Web sites GPO will not archive?
- Can I recommend a Federal Web site to be archived?
- After a Web site is harvested and archived how frequently is it re-crawled for new content?
- When GPO harvests the Web sites does GPO ever get copyrighted, non-government, or other extraneous material?
- How does GPO handle copyrighted, proprietary or Personally Identifiable Information (PII) in an archived website?
- At what level does GPO catalog the Web sites?
- If I think that a harvested Web site needs more granular cataloging, can I suggest that?
- Does GPO update the catalog/bibliographic record every time a Web site is re-crawled?
- Are Web sites in the FDLP Web Archive given SuDocs classes and item numbers?
- How can FDLP libraries obtain the bibliographic records in MARC format for the Web sites in the FDLP Web Archive?
- Why are records for the FDLP Web Archive cataloged?
- What do Archive-It’s error messages mean?
- What’s the difference between the FDLP Web Archive and GPO’s other archiving and cataloging tools?
Federal Web sites have become an important way that agencies communicate information to the public. However, Web content often appears or disappears without warning. Archiving these Web sites is part of fulfilling GPO’s mission to provide permanent public access to Government information. The content is made available in accordance with Title 44 of the US Code.
Archive-It uses a robotic crawler called Heritrix to gather content. Heritrix searches and captures an entire content rich Web site, creating a working facsimile of the site as it appeared on that day and time. This helps preserve the Web site content as it appeared at a particular point in time. After the first crawl, the Web site is then periodically re-crawled. In that process, Heritrix searches and captures the entire Web site again, creating a new working facsimile of the Web site as it appeared on the day and time of the re-crawl. All the facsimiles of the Web site are then accessible through the Wayback Machine, the Internet Archive’s digital archive of the World Wide Web. Links to our content in the Wayback Machine are available in the CGP and on the FDLP Web Archive collection page on the Archive-It Web site.
- Determine if the Web site is in the scope of the FDLP.
- Decide if Archive-It is the best way to capture and archive the site.
- Notify the agency of the harvesting.
- Manually review the Web site to create a seed list of URLs to include in the crawl.
- Run and QA a test crawl looking for any out of scope or missing content.
- Run and QA the Archive crawl.
- Run any necessary patch crawls.
- Create the catalog record for the Web site.
Steps 3-7 are repeated for each re-crawl of the Web site.
All the harvested data is stored on Archive-It’s servers.
GPO owns all the harvested data. The data is in the public domain.
Archive-It uses the WARC (Web ARChive) file format which conforms to ISO 28500:2009.
The Internet Archive (the parent organization of Archive-It) creates and keeps a primary copy and a back-up copy that are stored at the Internet Archive Data Center.
Videos in more simple formats such as WMV or MPEG4 can easily be captured and played back, however it can vary with more complex formats. Heritrix has the ability to capture videos in other formats and platforms, such as Flash or Vimeo, however playback can vary due to the complexity of the make-up or how the video is embedded on a page. Archive-It is continuously working to improve video playback for all formats and platforms and we regularly see enhancements being made.
We started with Web sites in the Y3 SuDocs class. The Y3 class is used for commissions, committees, and independent agencies. We started with this class stem because many of these agencies, commissions, and committees were disseminating information solely through their Web sites.
Now, we are harvesting and archiving other sites. We are looking at agencies to determine the number of online publications they produce, and we plan to move from the smaller sites to the larger sites. Additionally, we may harvest Web sites based on special request or if there is high interest in a Web site because, for example, it concerns a current topic of interest or the agency is doing work that is currently in the public eye.
In an attempt to avoid duplicative effort, we are not harvesting or archiving anything in GPO’s Federal Digital System (FDsys), anything already archived by other Archive-It partners, or anything already archived by our FDLP partners who are digitizing specific content from their FDLP collections (FDLP Partnerships). We are also not harvesting anything outside the scope of the FDLP.
Additionally, some Web sites, such as databases or Web sites where content is generated “on the fly” by a content management system, cannot be properly harvested or archived by Archive-It. In these instances, we try to create partnerships with the providing agencies to ensure permanent public access to their Web sites.
In the beginning our focus was on building the collection, and now we have increased our efforts to maintain and enhance what we have built. After a crawl is complete the site is analyzed to determine frequency of updates. The frequency of the re-crawl is then assigned according to how often the site is updated, either annual, biannual, or quarterly. We do not automatically run re-crawls, but follow a workflow very much like we do for any new site. For all re-crawls the site is fully analyzed, to evaluate if there are any changes to it, if the seed list needs to be updated, or if any new modifications need to be made before the new crawls are run.
The test crawl function of Archive-It eliminates this problem. If a test crawl brings back undesired results, you can modify the seed list accordingly to make sure resources aren’t wasted capturing unwanted material during the actual Archive crawl.
The granularity of the cataloging depends on the content of the Web site. Some Web sites warrant individual catalog records for individual seeds of the Web site, while other Web sites only need one record that links to the full site.
After a re-crawl, we look at the Web site to see if there were any major changes to the content. If there were major changes, we update the record.
Yes. The Web sites are classified under the agency’s general publications category from the list of classes, and then INTERNET is added to the end of the class. The archived Web sites are assigned the regular item number that accompanies the general publications class for each agency.
For example, the SuDoc class for “NARAtions: the blog of the United States National Archives” is AE 1.102:INTERNET, and the Item Number is 0569-B-02 (online).
Because records for the Web sites are created in the normal workflow, libraries can obtain these records the same way they obtain all their other CGP records.
These Web sites are cataloged in the CGP because they are in scope of GPO’s Cataloging and Indexing Program, which aims to develop a comprehensive and authoritative national bibliography of U.S. Government publications, to increase the visibility and use of Government information products, and to develop a premier destination for information searchers.
There are two kinds of error messages that users might encounter:
- “Not in Archive”– This means the content underlying the selected link was never captured.
- “Error”– This most likely suggests a problem with the local media player software, and the user should check to see if his/her software needs updating.
GPO’s Permanent Server (Permanent) is our longest-standing archive and the basis for the bulk of the FDLP Electronic Collection. Permanent is used to store archived versions of monographs, serials, and some video and audio recordings. All the publications on Permanent are manually harvested by GPO staff, or by an archivist’s prompt to Teleport Pro, a robotic crawler that captures simpler formats, such as PDF and HTML files.
FDsys is comprised of deposited content ingested by agreement with agencies. FDsys includes documents from all three branches of Government and includes the majority of the Congressional publications that we catalog. Unlike Permanent, with FDsys, users can directly search and browse the archived content, as well as access the metadata in XML format.
The Federal Depository Library Program Web Archive is used exclusively for permanent access to entire Federal agency Web sites. The web archive was created through a partnership between GPO and Archive-It. The content is gathered using the robotic crawler Heritrix, which searches and captures entire content-rich Web sites. The harvested sites are then stored on Archive-It’s servers. The archived sites are searchable and accessible through the CGP as well as the Archive-It Web site.
The CGP presents no full text, but rather MARC records with PURLs that link to digital content stored on any of these repositories.
GPO has conducted two webinars which offer more information about the FDLP Web Archive:
- Bringing Order to Chaos: Capturing and Preserving the Federal Web for Permanent Public Access
- Archiving & Cataloging Federal Agency Web Sites- GPO’s Web Archiving Project