Understanding How GPO’s Archive-It Content is Meeting the Needs of GPO’s Users

The U.S. Government Publishing Office (GPO) has been an Archive-It partner since 2011. All the content captured must be within scope of the Federal Depository Library Program (FDLP). As agencies increasingly disseminate information in digital format only, posting to their websites, web archiving proved a good option for continuing to provide permanent public access to Federal Agency web content. Each collection represents a single Government agency or website and includes their social media. The FDLP Web Archive may be accessed via the Catalog of U.S. Government Publications (CGP) and our Archive-It Collection Page. Understanding how our archived content is meeting the needs of users has become increasingly important to us to ensure we are meeting our goals for access. This article describes some of the methodologies used to evaluate data, as well as some challenges.

In August 2015, GPO began exploring methods to analyze our Archive-It data collected through Google Analytics. We had been requesting user metrics from Archive-It on a quarterly basis, but we wanted to see what more we could learn about our users.

Based on current industry research, we decided to begin collecting standard data such as tracking new and return users, the top ten referrals, landing pages, and what paths visitors used to reach the archived sites – direct, referral, or organic search.

To explain a little further, a referral means people click a link leading to a website. Direct means people land on a website through a bookmark, typing in the address, or clicking on a link within an email. Organic search means people used a search engine like Google or Bing to reach a website, and a landing page refers to the first page a person views inside a website. To help us answer those aforementioned questions, each month we create a spreadsheet with the questions as headers of individual sections.

Once we started reviewing the data, visitor numbers seemed out of the ordinary. Between June and October 2015, our number of average monthly users was approximately 2,600. This number seemed high compared to metrics that were previously provided by Archive-It. There was some digging to do!

Our referral data revealed the source of the high numbers from sites like free-social-buttons.com, floating-share-buttons.com, and get-free-social-traffic.com (Image 1). Each of these bizarre URLs are Ghost Spam, whose goal is to induce you to click on their website in analytics and lead you to spam sites. To clean our data of Ghost Spam we created a regular expression filter, wayback.archive\-it\.org|archive\-it\.org, to exclude invalid hostnames/Ghost spam. The filter worked, and we continue to be Ghost Spam free.

  • Image1 ghost spam
  • Once the data was spam free, we could begin answering emerging questions such as, “how much internal traffic is reflected in the data?” No matter how small internal traffic may be, it skews the data. Google Analytics does not make IP address data available, so the ability to separate internal traffic from external traffic is challenging. Another layer of difficulty arises because at times, visitors view the same collections we are currently archiving.

    To begin analyzing the data, we set the calendar dates to cover the month in which we are concentrating. Next, we access data from the Landing Page report and add a secondary dimension of City (Image 2). Once the dimensions are added, the key metrics to focus on include the city of Washington and whether those from Washington have a high ‘Pages per Session’ and ‘Average Session Duration.’ Depending on the size of the website, our team can spend multiple hours on a single collection, and this length of time causes data to be unreliable. Also available is a list of collections our team worked on for the month, allowing for quick reference and reassurance that a member of our team touched a website.

    Traffic originating from Washington does not automatically mean internal. In the Image 2 below, we have three colored circles. The yellow represents URLs that our team worked on at this time and have high ‘Average Session Durations,’ while the red represents an ‘Average Session Duration’ that is high, however was something not being worked on by our team at the time. It is important to compare which collections our team has been working on to our list of Washington hits.

  • Image 2 internal versus external hits
  • To begin creating a filter to block internal traffic, we found our IP address range. We also used the free tool on AnalyticsMarket called IP Range Regular Expression Builder, built an IP range filter (Image 3), and confirmed the validity of the “IP Range Regular Expression Builder.” Filters require a regular expression. Once our regular expression was created, it looked something like the following example ^100\.100\.05\.([1-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$. 

    Once the IP range filter was applied, we waited one week to allow data to build before checking the success of the filter. Unfortunately, the filter did not work. Blocking internal traffic is trickier than first expected because of the complexity of having a Dynamic Host Configuration Protocol (DHCP). Currently we are working towards configuring our dynamic IP range.

    As our knowledge of the data increases, so do our questions. For example, because we track referrals, we noticed a large increase of visitors from the U.S. Department of Health and Human Services (HHS) website. One month, we had 53 visitors from HHS, and the next month it increased to 1,090. We then discovered HHS created links to FDLP Web Archive content through their archive.hhs.gov site. From then on, visitor numbers from HHS grew and remain a steady source of visitors. About half of their visits are bounces (meaning they click the link to open our archived site and immediately close the window), but the visitors who do stay, browse an average of 10 minutes. These statistics indicate that users are finding information of interest, and that we are archiving information that is of value.

  • Image3 internaltrafficfilter
  • Since HHS is now a major source of visits per month, we updated our data spreadsheet. There is now a separate section for HHS to enable us to track the most viewed sites that come in through the HHS archive. This data will supply a picture about which major topics their users are interested in and help point us toward other valuable HHS sites to be archived.

    Besides HHS, our next largest user traffic comes from universities accessing the FDLP Web Archive collections through the PURLs in our CGP records. Most of the referring universities are part of the FDLP, but also several outside of the FDLP, and a few Canadian universities. Data from these universities shows about a 50/50 split between new and return visitors, only a 45% bounce rate, and those who stay browse an average of three pages, spending up to two minutes in our archive. From this data we can surmise that visitors are willing to click on the PURL and do some searching. The information we are gathering could be very useful for future collection development.

    Soon, we will begin answering new questions. For example, in the CGP we have been working towards enhancing accessibility of the FDLP Web Archive by supplying each collection’s catalog record with a second access point. One PURL leads to the calendar page of the homepage, and a second PURL leads to the collection page in Archive-It. How will our users utilize these options? Will one PURL prove to be more useful or equally as useful?

    A second change we made was adding broad subject facets and creator facets to our Archive-It metadata, allowing users to narrow down the 138 collections currently part of the FDLP Web Archive. Will there be subjects in which users are more interested? Will the broad subjects prove useful? All of these questions will be answered when we have more data to complete a whole picture.

    In looking at data from the last 1.5 years, we have found it to be encouraging. People are locating our collections, and the numbers have been steadily increasing. Government organizations are also becoming more aware of our work, and we hope more will begin linking to our web archive.