**Loading 2021 Biennial Survey Data for Question 6 into R**

This supplimental README file was draft in Spring 2023 by William & Mary student Sasan Faraj at the request of the DLC's Long-Term Impacts of COVID-19 working group.
Slight edits and additions were provided by the working group members.
The purpose was to educate members of the working group on how RStudio can be used to examine the 2021 Biennial Survey Data for Question 6.
The focus of the document is largely technical, with an emphasis placed upon explaining how the R code works. Less emphasis is placed on running searches in R.
Many of the commands below (ex. the setwd and install.packages commands) are automatically carried out by the R code provided by the working group. 
The discussion below largely explains how the code works.

**Practical Tutorials**

For a detailed tutorial on using RStudio to examine the 2021 Biennial Survey Data for Question 6, see Appendix IV of the working group's 2023 report.
That report includes detailed instructions on using RStudio and it provides search templates that researchers can use to run their own searches.
This readme document will provide a more technical summary of how R code works.
 
**Introduction**

The goal of this README is to provide examples of how to import an Excel sheet into RStudio, search the Excel file with RStudio (using the subset a dataframe function), and save/export the search results. 
The code for this document was produced using RStudio. The code is provided at the end of this document. 

Begin by downloading R, RStudio, and the Excel file "DLC-COVID19WG-2021-biennial-survey-q6-2023".
Open RStudio, and copy the following into the textbox after the > but do not hit Enter yet.


**Subsetting (searching) the Dataframe** 
We use logical statements every day of our lives. For instance, "I want X and Y". Maybe, "I want X and (Y or Z)". 
Like most database, RStudio uses Boolean Logic. It allows uses to find entries in a dataframe in which, for example: 
"the (library size is small) and (the library is from Alabama or Virginia) and (they changed outreach services)". 
In R, there are specific characters that match with those logical words when subsetting a dataframe:

Logical Statement ; Logical Operator
And	&
Or	|
Equal to	==
Not equal to	!=
Greater than	>
Less than	<
Greater than or equal to	>=
Less than or equal to	<=
In	%in%
Grouping	()
Is null	is.na()

**Some Rules about the Operators** 
Boolean rules apply in R, and the grouping line is especially important because parenthesis are how characteristics are grouped together. 
For instance, if a set has objects X, Y, and Z, the search X or (Y and Z) brings up results for X, Y and Z, or X Y Z.
A search for (X or Y) and Z yields results for criteria XY, YZ, or XYZ.
Commands for and/or are used to group conditionals in a specific order where the other operators are used to specifically compare elements against a threshold/condition.
It is important that each element within the subsetting/search is grouped into their own set of parentheses.

**When to use "is not" versus "is null"** 
Generally when looking for element that is not equal to a certain condition, use the != command. 
This works best when there is a specific and defined characteristic you want to exclude. 
For example, to exclude libraries in Virginia, us df[which((df$'State' != 'VA')),]. 

The "is null" command will return results in which a field was left blank.
For example, to run a search to find libraries that did not check the "partnership agreement" box, the command is:
is.na('Partnership Agreement') 
Many fields in the Excel sheet are either X or null, indicating they marked yes or no to an answer on the survey.
This command can be combined with others. For example, to find libraries that are not in a partnership agreement, or are formerly in a partnership agreement, the command would be: 
df[which( (is.na('Partnership Agreement') | (df$'Partnership Agreement' == 'Former') )),] .
There are more examples below.

**Structure of Subsetting/Searching**

The structure of subsetting a dataframe in R is roughly as follows (depends on number of conditions).
The code has automatically labeled the whole Excel file, or dataframe, as "df" so this will denote the full dataset and includes all datapoints.

Example search:
df[which((df$"column_name1" == X) & (df$"column_name2" %in% cbind(X,Y))), ]
This translates to the following statement: Within the dataframe, return all the rows where the elements of column 1 are equal to X and the elements of column two are either X or Y. 
If X and Y are "strings" (a written text reply), write "X" or "y" exactly as presented in the Excel. For example, "Academic Library (AL)" needs to be typed exactly as "Academic Library (AL)" in the R command.
Note, cbind() is a function that creates a vector (or group) in R, so it allows you to combine multiple elements. 
This is useful when searching for libraries from two or more states. The which() function is not necessary, but helps to ensure that R is not returning a dataframe of null rows. 
For full documentation on using Which, see https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/which  
Lastly, the comma is what tells R to return all the rows.

**Some Search Examples**

For more example searches, and detailed discussion of how they work, see "Quantitative Data Summary" session of the working group's 2023 report.
This section is meant to provide a mechanical understanding of the code itself.

To look at libraries that were located in either Alabama or Virginia AND marked X (yes) for changes to outreach services, the command is:
df[which((df$'State' %in% cbind('VA','AL')) & df$'Outreach Services' == 'X'),]

To look at small libraries where either staffing or reference services are X (yes):
df[which((df$'Library Size' =='Small (less than 250,000 volumes in the library)') & (df$'Staffing' == 'X'| df$"Reference services"=='X')),]
Important note: the whole expression Small (less than 250,000 volumes in the library) needs to be typed out, exactly as it appears in the Excel.
If a character is different, RStudio will not find it in the Excel. Capitalization, space, and punctuation matter.
R processes strings such that “Small” is different from “small” and are both different from “ small ”. 
The search much match exactly whatever is written in the Excel column.
There are direction below on how to find the exact capitalization, space, and punctuation, in R without needing to look in the Excel file directly

To run a search in which library type is academic law and there were no new policies or procedures implements:
df[which((df$"Library Type" == "Academic, Law Library (AL)") & (df$"No new policies or procedures implemented." == "X")),]

This are very basic searches to show the general mechanics. See the working group report section for a much more detail tutorial.

**Quick Workarounds** 
The more frequently R is used, users will be able to writ queries more swiftly. There are a few ways to help this learning process.

To list of all the column names, write colnames(df) and set that equal to a variable names provided in RStudio.
To write a new query, print out the names to copy and past the variable names needed.
It is cumbersome to always have to type of each value, such as "Small (less than 250,000 volumes in the library)" 
To get around this, use the unique() function. It takes in a column and returns a list of its unique values, showing the exact spelling and format of every possible variable.
For example, print out unique(df$"Library Size"), copy the string associated with it, and place it in thequery.

**Downloading Your Dataframe Back into Excel** 
Searches conducted in RStudio can be saved back as new xlsx files.
To do this, run install.packages("writexl") to install the package needed. NOTE: this is already built into the R code above.
This allows users to use the (writexl) command.
To use this function, type write_xlsx(#,"#.xlsx") where # is the name of a search.
For example, for the search
k=df[which((df$"Library Type" == "Academic, Law Library (AL)") & (df$"No new policies or procedures implemented." == "X")),]
The results for this search "k" can be exported into an Excel file with the command
write_xlsx(k,"k.xlsx")
To export the entire datafile (the df) back into an xlsx file, use the command
write_xlsx(df,"df.xlsx")
The name of the search can be changed to something else, provided it still ends with .xlsx.
Example: write_xlsx(k,"academiclibrarys.xlsx") will download search k as an Excel file called academiclibrarys.xlsx
Here is the documentation for this function: https://www.rdocumentation.org/packages/xlsx/versions/0.6.5/topics/write.xlsx 

**Further Information**
A detailed walkthru on loading the code and using RStudio to examine the data for Question 6 is included in Appendix IV of 
First Report of the Depository Library Council’s Working Group on the Long-Term Impacts of COVID-19 on Federal Depository Library Access, Collections, and Services (2023).

**Code**
This is the complete start-up code, which is included for reasons of documentation.
For instructions on using this to begin your own searches, see StartupCode.txt and Appendix IV of the 2023 working group report.

#------------WORKING DIRECTORY------------

#The command below sets the r session to 
#work out of a specific folder.
#Copy and paste the file path location
#of your data into the quotations
# next to filepath 
# (right click it in files
# and look for the location). 
#Paste it inside the quotation marks.
#ENSURE TO USE FORWARD SLASHES AND NOT BACKWARDS

filepath = ''
setwd(filepath)

#------------INSTALL NECESSARY PACKAGES TO COMPUTER------------
if ("readxl" %in%  installed.packages()[,1]){
  print('readxl has already installed.')
} else {
  install.packages('readxl')
}

if ("writexl" %in%  installed.packages()[,1]){
  print('writexl has already been installed.')
} else {
  install.packages('writexl')
}

#------------BRING LIBRARIES INTO R SESSION------------
library(readxl)
library(writexl)

#------------BRING IN DATAFRAME------------

#Replace "DLC-COVID19WG-2021-biennial-survey-q6-2023.xlsx" with the actual name of the file
#with the name of another xlsx file, if you would like to change
#the data you want to work with.

df <- read_xlsx('DLC-COVID19WG-2021-biennial-survey-q6-2023.xlsx') #assigns the xlsx to the variable df

str(df) #shows what type of data each column is

names = colnames(df) #to make life easier, use colnames and print them out so 
                      #you can copy and past the names when you subset.
print(names)

#------------SUBSET DATAFRAME------------

#subset a dataframe based on one column 
a = df[which((df$'Outreach Services' == 'X')),]

#---------------------------------------
#subset a dataframe based on two columns

#Libraries in either Virginia or Alabama who marked X for outreaches services.
b = df[which((df$'State' %in% cbind('VA','AL')) & (df$'Outreach Services' == 'X')),]

print(unique(df$'Library Size')) #Unique returns the unique values within a 
                                 #column. You can copy and past a value you 
                                 #want to subset for!

#----------------------------------------
#subset a dataframe based on three columns
#small library size and either staffing or reference services are X
c = df[which((df$'Library Size' =='Small (less than 250,000 volumes in the library)') & (df$'Staffing' == 'X'| df$'Reference Services'=='X')),]

print(unique(df$'Library Type')) 
#Library type is academic law and there were no new policies or procedures implements
d = df[which((df$'Library Type' == 'Academic, Law Library (AL)') & (df$'No new policies or procedures implemented' == 'X')),]

#Libraries that are selective type Regional 
#AND made changes to Collection maintenance and weeding 
#BUT did NOT make changes to the partnership agreement
e = df[which(((df$'Depository Type'=='Selective')
         & (df$'Collection Maintenance and Weeding'=='X')) & 
          ( (is.na(df$'Partnership Agreement'))|
           (df$'Partnership Agreement' =='Former'))),] 

#------------EXPORT DATAFRAME------------

#save your dataframe as an xlsx
write_xlsx(c,'c.xlsx')

**Importing Data from Excel** 
To import data into RStudio, the code will rely on functions. A function is a coding tool used to make repetitive coding tasks easier to perform. 
Insert specified data into the function, and it returns a specified output. With any function used to perform tasks in R, users must first ensure that the function’s library is installed. 
The library is the overall software package that contains different functions (and sometimes data).

First, ensure the RStudio session is working out of the same folder/directory in which the data is stored. If R is already working out of the same folder, there is also no need for this step.
If not, run the setwd function that is preloaded into RStudio. This stands for set working directory.

setwd("File Path to Data")

To use the correct file path, right click the Excel file from the folder view, click the option for more information, and copy the location or file path. 
To run a piece of code, highlight it and click the run button on the top right of the script window.

To bring the Excel sheet into RStudio, the provided code uses the readxl package. 
More can be read about the readxl using the source documentation.

The R code that is provide achieves this using the following codeblock:

install.packages("readxl")
library(readxl)

The first line only needs to be run once to ensure that the computer has access to the package ‘readxl’. 
The second line actually brings the package into the R session, and it needs to be run each time entering into an R session.

With this package, the Excel file can be uploaded. 
For this specific example using a .xlsx file, the following piece of code is used. 
The file tag (.xlsx) needs to be included within the quotations.

df <- read_xlsx("DLC-COVID19WG-2021-biennial-survey-q6-2023.xlsx")

This line of code uses the ‘read_xlsx’ function to bring the xlsx sheet into our R session as a dataframe, which here is labeled as "df". 
From this point on, the dataframe (or all the data in the .xlsx file) is identifed as df. Write and run view(df) to view all the data, or click "df" on the top right panel of the RStudio session.

There are analogous functions for other types of Excel sheets that can find in the documentation.