In this lab, we will learn how to find out about functions and the principles of lexical scoping.

Goal: by the end of this lab, you should be able to retrieve information about functions and understand how lexical scoping works in R.

Learning about functions

There are many functions in R, and nobody knows how all of them work. From time to time it is helpful to retrieve information about these functions. The documentation in help() is great to read, but it doesn’t always provide the information you need.

Arguments

The first question you may have is what arguments a function takes. We can learn this by running args() on the bare name of the function.

args(dplyr::filter)
## function (.data, ..., .preserve = FALSE) 
## NULL
args(print)
## function (x, ...) 
## NULL

Note that for S3 generic function like print(), you can also find the arguments for specific methods.

args(print.data.frame)
## function (x, ..., digits = NULL, quote = FALSE, right = TRUE, 
##     row.names = TRUE, max = NULL) 
## NULL
  1. Find the arguments for tidyr::gather().
# SAPLE SOLUTION

args(tidyr::gather)
## function (data, key = "key", value = "value", ..., na.rm = FALSE, 
##     convert = FALSE, factor_key = FALSE) 
## NULL
  1. Think of another function whose arguments you always forget. Find them using args().

The body

The body() function returns the actual code that runs inside a function. There are a several reasons why you might want to see this:

  • It’s the only way to actually see what a function is doing
  • Reading other people’s code is a great way to improve your own ability to write code
  • If the documentation doesn’t answer your question, perhaps the code will

Inspecting the code inside a function will help you to think more like a developer, as opposed to a user.

body(tibble)
## {
##     xs <- quos(...)
##     tibble_quos(xs, .rows, .name_repair)
## }

Note that a function may call unexported functions (e.g., tibble_quos()) that you won’t recognize. Since the function is not exported, you have to use the triple colon operator to view its source.

body(tibble:::tibble_quos)
## {
##     col_names <- given_col_names <- names2(xs)
##     empty_col_names <- which(col_names == "")
##     col_names[empty_col_names] <- names(quos_auto_name(xs[empty_col_names]))
##     lengths <- rep_along(xs, 0L)
##     output <- rep_along(xs, list(NULL))
##     env <- new_environment()
##     mask <- new_data_mask_with_data(env)
##     first_size <- .rows
##     for (j in seq_along(xs)) {
##         res <- eval_tidy(xs[[j]], mask)
##         if (!is.null(res)) {
##             if (single_row) {
##                 if (vec_is(res)) {
##                   if (vec_size(res) != 1) {
##                     cnd_signal(error_tibble_row_size_one(j, given_col_names[[j]], 
##                       vec_size(res)))
##                   }
##                 }
##                 else {
##                   res <- list(res)
##                 }
##             }
##             else {
##                 res <- check_valid_col(res, col_names[[j]], j)
##                 lengths[[j]] <- current_size <- vec_size(res)
##                 if (is.null(first_size)) {
##                   first_size <- current_size
##                 }
##                 else if (first_size == 1L && current_size != 
##                   1L) {
##                   idx_to_fix <- seq2(1L, j - 1L)
##                   output[idx_to_fix] <- fixed_output <- map(output[idx_to_fix], 
##                     vec_recycle, current_size)
##                   map2(output[idx_to_fix], col_names[idx_to_fix], 
##                     add_to_env2, env = env)
##                   first_size <- current_size
##                 }
##                 else {
##                   res <- vectbl_recycle_rows(res, first_size, 
##                     j, given_col_names[[j]])
##                 }
##             }
##             output[[j]] <- res
##             col_names[[j]] <- add_to_env2(res, given_col_names[[j]], 
##                 col_names[[j]], env)
##         }
##     }
##     names(output) <- col_names
##     is_null <- map_lgl(output, is.null)
##     output <- output[!is_null]
##     output <- splice_dfs(output)
##     output <- set_repaired_names(output, repair_hint = TRUE, 
##         .name_repair = .name_repair)
##     new_tibble(output, nrow = first_size %||% 0L)
## }

If you try to view the body of an S3 generic function, you will find that it is rather short.

body(print)
## UseMethod("print")

A generic function usually doesn’t do anything other than call one of its methods. Those methods have the interesting code.

body(print.data.frame)
## {
##     n <- length(row.names(x))
##     if (length(x) == 0L) {
##         cat(sprintf(ngettext(n, "data frame with 0 columns and %d row", 
##             "data frame with 0 columns and %d rows"), n), "\n", 
##             sep = "")
##     }
##     else if (n == 0L) {
##         print.default(names(x), quote = FALSE)
##         cat(gettext("<0 rows> (or 0-length row.names)\n"))
##     }
##     else {
##         if (is.null(max)) 
##             max <- getOption("max.print", 99999L)
##         if (!is.finite(max)) 
##             stop("invalid 'max' / getOption(\"max.print\"): ", 
##                 max)
##         omit <- (n0 <- max%/%length(x)) < n
##         m <- as.matrix(format.data.frame(if (omit) 
##             x[seq_len(n0), , drop = FALSE]
##         else x, digits = digits, na.encode = FALSE))
##         if (!isTRUE(row.names)) 
##             dimnames(m)[[1L]] <- if (isFALSE(row.names)) 
##                 rep.int("", if (omit) 
##                   n0
##                 else n)
##             else row.names
##         print(m, ..., quote = quote, right = right, max = max)
##         if (omit) 
##             cat(" [ reached 'max' / getOption(\"max.print\") -- omitted", 
##                 n - n0, "rows ]\n")
##     }
##     invisible(x)
## }
body(print.default)
## {
##     args <- pairlist(digits = digits, quote = quote, na.print = na.print, 
##         print.gap = print.gap, right = right, max = max, width = width, 
##         useSource = useSource, ...)
##     missings <- c(missing(digits), missing(quote), missing(na.print), 
##         missing(print.gap), missing(right), missing(max), missing(width), 
##         missing(useSource))
##     .Internal(print.default(x, args, missings))
## }
  1. Inspect the code for the summary method for an lm object. Do you see where the \(R^2\) is computed?
# SAMPLE SOLUTION

body(summary.lm)
## {
##     z <- object
##     p <- z$rank
##     rdf <- z$df.residual
##     if (p == 0) {
##         r <- z$residuals
##         n <- length(r)
##         w <- z$weights
##         if (is.null(w)) {
##             rss <- sum(r^2)
##         }
##         else {
##             rss <- sum(w * r^2)
##             r <- sqrt(w) * r
##         }
##         resvar <- rss/rdf
##         ans <- z[c("call", "terms", if (!is.null(z$weights)) "weights")]
##         class(ans) <- "summary.lm"
##         ans$aliased <- is.na(coef(object))
##         ans$residuals <- r
##         ans$df <- c(0L, n, length(ans$aliased))
##         ans$coefficients <- matrix(NA_real_, 0L, 4L, dimnames = list(NULL, 
##             c("Estimate", "Std. Error", "t value", "Pr(>|t|)")))
##         ans$sigma <- sqrt(resvar)
##         ans$r.squared <- ans$adj.r.squared <- 0
##         ans$cov.unscaled <- matrix(NA_real_, 0L, 0L)
##         if (correlation) 
##             ans$correlation <- ans$cov.unscaled
##         return(ans)
##     }
##     if (is.null(z$terms)) 
##         stop("invalid 'lm' object:  no 'terms' component")
##     if (!inherits(object, "lm")) 
##         warning("calling summary.lm(<fake-lm-object>) ...")
##     Qr <- qr.lm(object)
##     n <- NROW(Qr$qr)
##     if (is.na(z$df.residual) || n - p != z$df.residual) 
##         warning("residual degrees of freedom in object suggest this is not an \"lm\" fit")
##     r <- z$residuals
##     f <- z$fitted.values
##     w <- z$weights
##     if (is.null(w)) {
##         mss <- if (attr(z$terms, "intercept")) 
##             sum((f - mean(f))^2)
##         else sum(f^2)
##         rss <- sum(r^2)
##     }
##     else {
##         mss <- if (attr(z$terms, "intercept")) {
##             m <- sum(w * f/sum(w))
##             sum(w * (f - m)^2)
##         }
##         else sum(w * f^2)
##         rss <- sum(w * r^2)
##         r <- sqrt(w) * r
##     }
##     resvar <- rss/rdf
##     if (is.finite(resvar) && resvar < (mean(f)^2 + var(c(f))) * 
##         1e-30) 
##         warning("essentially perfect fit: summary may be unreliable")
##     p1 <- 1L:p
##     R <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
##     se <- sqrt(diag(R) * resvar)
##     est <- z$coefficients[Qr$pivot[p1]]
##     tval <- est/se
##     ans <- z[c("call", "terms", if (!is.null(z$weights)) "weights")]
##     ans$residuals <- r
##     ans$coefficients <- cbind(Estimate = est, `Std. Error` = se, 
##         `t value` = tval, `Pr(>|t|)` = 2 * pt(abs(tval), rdf, 
##             lower.tail = FALSE))
##     ans$aliased <- is.na(z$coefficients)
##     ans$sigma <- sqrt(resvar)
##     ans$df <- c(p, rdf, NCOL(Qr$qr))
##     if (p != attr(z$terms, "intercept")) {
##         df.int <- if (attr(z$terms, "intercept")) 
##             1L
##         else 0L
##         ans$r.squared <- mss/(mss + rss)
##         ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - 
##             df.int)/rdf)
##         ans$fstatistic <- c(value = (mss/(p - df.int))/resvar, 
##             numdf = p - df.int, dendf = rdf)
##     }
##     else ans$r.squared <- ans$adj.r.squared <- 0
##     ans$cov.unscaled <- R
##     dimnames(ans$cov.unscaled) <- dimnames(ans$coefficients)[c(1, 
##         1)]
##     if (correlation) {
##         ans$correlation <- (R * resvar)/outer(se, se)
##         dimnames(ans$correlation) <- dimnames(ans$cov.unscaled)
##         ans$symbolic.cor <- symbolic.cor
##     }
##     if (!is.null(z$na.action)) 
##         ans$na.action <- z$na.action
##     class(ans) <- "summary.lm"
##     ans
## }

The environment

It is also important to know in what environment a function is created. The environment() function tells us.

environment(print)
## <environment: namespace:base>
environment(filter)
## <environment: R_GlobalEnv>

We’ll learn more about environments in the next chapter.

Scoping

R uses lexical scoping. This means that when R looks for the value of names, it depends on the state of things when the function is defined.

Name masking

R looks for variables inside a function definition first (local variables), but if it can’t find them, it looks in the parent environment. In this case, when my_fun() is executed, R looks for the value of global_var. Many programming languages will throw an error in this situation, because global_var is not defined in the environment created by my_fun(). However, R just keeps looking for global_var in the parent environment (which in this case is the global environment).

global_var <- 21

my_fun <- function(x) {
  x + global_var
}

my_fun(3)
## [1] 24
environment(my_fun)
## <environment: R_GlobalEnv>

Note also that if we try to re-define global_var inside the function definition, we are masking the name global_var.

my_fun2 <- function(x) {
  global_var <- 99
  x + global_var
}

my_fun2(3)
## [1] 102

Note the difference reported by findGlobals().

codetools::findGlobals(my_fun)
## [1] "{"          "+"          "global_var"
codetools::findGlobals(my_fun2)
## [1] "{"  "+"  "<-"
  1. Write a function analogous to my_fun() in Python or Java. Does it work?

A fresh start

Local variables that are created inside a function never see the light of day in the parent environment.

my_fun3 <- function(x) {
  local_var <- 7
  x + global_var
}

my_fun3(3)
## [1] 24
local_var
## Error in eval(expr, envir, enclos): object 'local_var' not found
  1. Why would it be a bad idea to allow the value of local variables to persist outside of the function?

Dynamic lookup

Because of name masking, values of objects in the parent environment can affect the behavior of a function. This behavior is often useful in data analysis scripts, but can be problematic in more formal programming.

This means that if we reset the value of global_var, we change the behavior of my_fun().

global_var <- 15

my_fun(3)
## [1] 18
  1. Give an example of a situation in which dynamic lookup is helpful.

  2. Give an example of a situation in which dynamic lookup could be problematic.

A function that depends only on its inputs is called pure.

Engagement

Take a minute to think about what questions you still have about subsetting. Review what questions have been posted (in the #questions channel) recently by other students and either:

  • respond (e.g., react, comment, clarify, or answer)
  • post a new question

Prompt: Which topic(s) from the chapter on functions could use further elaboration?