Web Archiving

Project Description

The FDLP Web Archive provides point in time captures of U.S. Federal agency websites, while preserving the functionality of the site to the extent possible. The aim is to provide permanent public access to Federal Agency Web content. GPO harvests and archives the websites with Archive-It, a subscription-based Web harvesting and archiving service offered by the Internet Archive.

Ways to Access the Archived Sites

Catalog of U.S. Government Publications

Bibliographic records for the archived websites, which describe the sites and link to them via PURL (Persistent URL), are searchable and accessible through the Catalog of U.S. Government Publications (CGP) FDLP Web Archive Page.
To view all of the records in the FDLP Web Archive, go here.

Archive-It Website

The archived websites can also be searched and accessed through the FDLP Web Archive Collection page on the Archive-It Web site.

Frequently Asked Questions

Archive-It uses a combination of crawling tools they have developed, including Heritrix and Umbra to gather content. The crawler searches and captures an entire content rich website, creating a working facsimile of the site as it appeared on that day and time. This helps preserve the website content as it appeared at a particular point in time. After the first crawl, the website is then periodically re-crawled. In that process, the crawler searches and captures the entire website again, creating a new working facsimile of the website as it appeared on the day and time of the re-crawl. All the facsimiles of the website are then accessible through the Wayback Machine, the Internet Archive’s digital archive of the World Wide Web. Links to our content in the Wayback Machine are available in the CGP and on the FDLP Web Archive collection page on the Archive-It website.

Determine if the website is in the scope of the FDLP.
Determine if Archive-It is the best tool for archiving the website and achievability.
Notify the agency of intent to harvest data from their website.
Manually review the website to create a seed list of URLs to include in the crawl.
Run and QA a test crawl looking for any out of scope or missing content.
Run any additional test crawls as needed to ensure crawl will be effective and efficient in production crawl, usually multiple test crawls are done.
Run and QA the production crawl.
Run any necessary patch crawls.
Create a record for the website in the CGP.

Steps 3-7 are repeated for each re-crawl of the website.

Videos in more simple formats such as WMV or MPEG4 can easily be captured and played back, however it can vary with more complex formats. Archive-It crawling technology has the ability to capture videos in other formats and platforms, such as Flash or Vimeo, however playback can vary due to the complexity of the make-up or how the video is embedded on a page. Archive-It is continuously working to improve video playback for all formats and platforms and we regularly see enhancements being made.

We started with websites in the Y3 SuDocs class. The Y3 class is used for commissions, committees, and independent agencies. We started with this class stem because many of these agencies, commissions, and committees were disseminating information solely through their websites.

Now, we are harvesting and archiving other sites. We are looking at agencies to determine the number of online publications they produce, and we plan to move from the smaller sites to the larger sites. Additionally, we may harvest websites based on special request or if there is high interest in a website because, for example, it concerns a current topic of interest or the agency is doing work that is currently in the public eye.

In an attempt to avoid duplicative effort, we are not harvesting or archiving anything in GPO’s govinfo, anything already archived by other Archive-It partners, or anything already archived by our FDLP partners who are digitizing specific content from their FDLP collections (FDLP Partnerships). We are also not harvesting anything outside the scope of the FDLP.

Additionally, some websites, such as databases or websites where content is generated “on the fly” by a content management system, cannot be properly harvested or archived by Archive-It. In these instances, we try to create partnerships with the providing agencies to ensure permanent public access to their web content.

In the beginning our focus was on building the web archive, however then we had to shift our efforts to maintain and enhance what we had built. After a crawl is complete the site is analyzed to determine frequency of updates. The frequency of the re-crawl is then assigned according to how often the site is updated, either annual, biannual, or quarterly. We do not automatically run re-crawls, but follow a workflow very much like we do for any new site. For all re-crawls the site is fully analyzed, to evaluate if there are any changes to it, if the seed list needs to be updated, or if any new modifications need to be made before the new crawls are run.

Due to the fact that our main collection development practice is to archive content that would traditionally be included in the Federal Depository Library Program, we only seek to archive Federal Government information that is publically available. We would never intentionally harvest any copyrighted, proprietary or PII. If you suspect that we have unintentionally harvested any such content, please contact us at [email protected] and provide us with the information, including the Wayback URL, and we will review this content for possible removal following Superintendent of Documents policy.

Yes. The websites are classified under the agency’s general publications category from the list of classes, and then INTERNET is added to the end of the class. The archived websites are assigned the regular item number that accompanies the general publications class for each agency.

For example, the SuDoc class for “NARAtions: the blog of the United States National Archives” is AE 1.102:INTERNET, and the Item Number is 0569-B-02 (online).

GPO’s Permanent Server (Permanent) is our longest-standing archive and the basis for the bulk of the FDLP Electronic Collection. Permanent is used to store archived versions of monographs, serials, and some video and audio recordings. All the publications on Permanent are manually harvested by GPO staff, or by an archivist’s prompt to Teleport Pro, a robotic crawler that captures simpler formats, such as PDF and HTML files.

govinfo is comprised of deposited content ingested by agreement with agencies. govinfo includes documents from all three branches of Government and includes the majority of the Congressional publications that we catalog. Unlike Permanent, with govinfo users can directly search and browse content, as well as access the metadata in XML format.

The Federal Depository Library Program Web Archive is used exclusively for permanent access to entire Federal agency websites. The web archive was created through a partnership between GPO and Archive-It. The content is gathered using the robotic crawler Heritrix, which searches and captures entire content-rich websites. The harvested sites are then stored on Archive-It’s servers. The archived sites are searchable and accessible through the CGP as well as the Archive-It website.

The CGP presents no full text, but rather MARC records with PURLs that link to digital content stored on any of these repositories.

Webinars

GPO has conducted a number of webinars and presentations which offer more information about the FDLP web archive.

Web Archiving for the FDLP
Archiving & Cataloging Federal Agency Web Sites – GPO’s Web Archiving Project
A Time Machine for Federal Information – Using Web Archiving Content in Government Information Reference Work
Tangible and Digital Preservation: Bridging the Divide by Preserving Government Information in All Formats (Update on FDLP Web Archive begins 30 minutes in)
Archive-It Advanced Training – Access to Archive-It Collections Archive-It Advanced Training – Access to Archive-It Collections (Information about our Analytics of the FDLP Web Archive begins 24 minutes in)

Contact Us

For question or suggestions about the FDLP Web Archive, please contact us at [email protected].

Project Description

Ways to Access the Archived Sites

Frequently Asked Questions

Why is GPO archiving Federal websites?

How does Web archiving with Archive-It work?

Why Archive-It?

What is our workflow for harvesting and archiving using Archive-It?

Where is the harvested data stored?

Who owns the harvested data?

What file format is used?

How is the FDLP Web Archive data backed-up?

Can Archive-It capture and play back video?

What Federal websites is GPO archiving?

Are there any Federal websites GPO will not archive?

Can I recommend a Federal website to be archived?

After a website is harvested and archived, how frequently is it re-crawled for new content?

When GPO harvests the websites, does GPO ever get copyrighted, non-government, or other extraneous material?

How does GPO handle copyrighted, proprietary or Personally Identifiable Information (PII) in an archived website?

At what level does GPO catalog the websites?

If I think that a harvested website needs more granular cataloging, can I suggest that?

Does GPO update the catalog/bibliographic record every time a website is re-crawled?

Are websites in the FDLP Web Archive given SuDocs classes and item numbers?

How can FDLP libraries obtain the bibliographic records in MARC format for the websites in the FDLP Web Archive?

Why are records for the FDLP Web Archive cataloged?

What do Archive-It’s error messages mean?

What’s the difference between the FDLP Web Archive and GPO’s other archiving and cataloging tools?

Webinars

Contact Us