In this lab, we will learn how to find out about functions and the principles of lexical scoping.
Goal: by the end of this lab, you should be able to retrieve information about functions and understand how lexical scoping works in R.
There are many functions in R, and nobody knows how
all of them work. From time to time it is helpful to
retrieve information about these functions. The documentation in
help()
is great to read, but it doesn’t always provide the
information you need.
The first question you may have is what arguments a function takes.
We can learn this by running args()
on the bare name of the
function.
args(dplyr::filter)
## function (.data, ..., .preserve = FALSE)
## NULL
args(print)
## function (x, ...)
## NULL
Note that for S3 generic function like print()
, you can
also find the arguments for specific methods.
args(print.data.frame)
## function (x, ..., digits = NULL, quote = FALSE, right = TRUE,
## row.names = TRUE, max = NULL)
## NULL
tidyr::gather()
.# SAPLE SOLUTION
args(tidyr::gather)
## function (data, key = "key", value = "value", ..., na.rm = FALSE,
## convert = FALSE, factor_key = FALSE)
## NULL
args()
.The body()
function returns the actual code that runs
inside a function. There are a several reasons why you might want to see
this:
Inspecting the code inside a function will help you to think more like a developer, as opposed to a user.
body(tibble)
## {
## xs <- quos(...)
## tibble_quos(xs, .rows, .name_repair)
## }
Note that a function may call unexported functions (e.g.,
tibble_quos()
) that you won’t recognize. Since the function
is not exported, you have to use the triple colon operator to view its
source.
body(tibble:::tibble_quos)
## {
## col_names <- given_col_names <- names2(xs)
## empty_col_names <- which(col_names == "")
## col_names[empty_col_names] <- names(quos_auto_name(xs[empty_col_names]))
## lengths <- rep_along(xs, 0L)
## output <- rep_along(xs, list(NULL))
## env <- new_environment()
## mask <- new_data_mask_with_data(env)
## first_size <- .rows
## for (j in seq_along(xs)) {
## res <- eval_tidy(xs[[j]], mask)
## if (!is.null(res)) {
## if (single_row) {
## if (vec_is(res)) {
## if (vec_size(res) != 1) {
## cnd_signal(error_tibble_row_size_one(j, given_col_names[[j]],
## vec_size(res)))
## }
## }
## else {
## res <- list(res)
## }
## }
## else {
## res <- check_valid_col(res, col_names[[j]], j)
## lengths[[j]] <- current_size <- vec_size(res)
## if (is.null(first_size)) {
## first_size <- current_size
## }
## else if (first_size == 1L && current_size !=
## 1L) {
## idx_to_fix <- seq2(1L, j - 1L)
## output[idx_to_fix] <- fixed_output <- map(output[idx_to_fix],
## vec_recycle, current_size)
## map2(output[idx_to_fix], col_names[idx_to_fix],
## add_to_env2, env = env)
## first_size <- current_size
## }
## else {
## res <- vectbl_recycle_rows(res, first_size,
## j, given_col_names[[j]])
## }
## }
## output[[j]] <- res
## col_names[[j]] <- add_to_env2(res, given_col_names[[j]],
## col_names[[j]], env)
## }
## }
## names(output) <- col_names
## is_null <- map_lgl(output, is.null)
## output <- output[!is_null]
## output <- splice_dfs(output)
## output <- set_repaired_names(output, repair_hint = TRUE,
## .name_repair = .name_repair)
## new_tibble(output, nrow = first_size %||% 0L)
## }
If you try to view the body of an S3 generic function, you will find that it is rather short.
body(print)
## UseMethod("print")
A generic function usually doesn’t do anything other than call one of its methods. Those methods have the interesting code.
body(print.data.frame)
## {
## n <- length(row.names(x))
## if (length(x) == 0L) {
## cat(sprintf(ngettext(n, "data frame with 0 columns and %d row",
## "data frame with 0 columns and %d rows"), n), "\n",
## sep = "")
## }
## else if (n == 0L) {
## print.default(names(x), quote = FALSE)
## cat(gettext("<0 rows> (or 0-length row.names)\n"))
## }
## else {
## if (is.null(max))
## max <- getOption("max.print", 99999L)
## if (!is.finite(max))
## stop("invalid 'max' / getOption(\"max.print\"): ",
## max)
## omit <- (n0 <- max%/%length(x)) < n
## m <- as.matrix(format.data.frame(if (omit)
## x[seq_len(n0), , drop = FALSE]
## else x, digits = digits, na.encode = FALSE))
## if (!isTRUE(row.names))
## dimnames(m)[[1L]] <- if (isFALSE(row.names))
## rep.int("", if (omit)
## n0
## else n)
## else row.names
## print(m, ..., quote = quote, right = right, max = max)
## if (omit)
## cat(" [ reached 'max' / getOption(\"max.print\") -- omitted",
## n - n0, "rows ]\n")
## }
## invisible(x)
## }
body(print.default)
## {
## args <- pairlist(digits = digits, quote = quote, na.print = na.print,
## print.gap = print.gap, right = right, max = max, width = width,
## useSource = useSource, ...)
## missings <- c(missing(digits), missing(quote), missing(na.print),
## missing(print.gap), missing(right), missing(max), missing(width),
## missing(useSource))
## .Internal(print.default(x, args, missings))
## }
summary
method for an
lm
object. Do you see where the \(R^2\) is computed?# SAMPLE SOLUTION
body(summary.lm)
## {
## z <- object
## p <- z$rank
## rdf <- z$df.residual
## if (p == 0) {
## r <- z$residuals
## n <- length(r)
## w <- z$weights
## if (is.null(w)) {
## rss <- sum(r^2)
## }
## else {
## rss <- sum(w * r^2)
## r <- sqrt(w) * r
## }
## resvar <- rss/rdf
## ans <- z[c("call", "terms", if (!is.null(z$weights)) "weights")]
## class(ans) <- "summary.lm"
## ans$aliased <- is.na(coef(object))
## ans$residuals <- r
## ans$df <- c(0L, n, length(ans$aliased))
## ans$coefficients <- matrix(NA_real_, 0L, 4L, dimnames = list(NULL,
## c("Estimate", "Std. Error", "t value", "Pr(>|t|)")))
## ans$sigma <- sqrt(resvar)
## ans$r.squared <- ans$adj.r.squared <- 0
## ans$cov.unscaled <- matrix(NA_real_, 0L, 0L)
## if (correlation)
## ans$correlation <- ans$cov.unscaled
## return(ans)
## }
## if (is.null(z$terms))
## stop("invalid 'lm' object: no 'terms' component")
## if (!inherits(object, "lm"))
## warning("calling summary.lm(<fake-lm-object>) ...")
## Qr <- qr.lm(object)
## n <- NROW(Qr$qr)
## if (is.na(z$df.residual) || n - p != z$df.residual)
## warning("residual degrees of freedom in object suggest this is not an \"lm\" fit")
## r <- z$residuals
## f <- z$fitted.values
## w <- z$weights
## if (is.null(w)) {
## mss <- if (attr(z$terms, "intercept"))
## sum((f - mean(f))^2)
## else sum(f^2)
## rss <- sum(r^2)
## }
## else {
## mss <- if (attr(z$terms, "intercept")) {
## m <- sum(w * f/sum(w))
## sum(w * (f - m)^2)
## }
## else sum(w * f^2)
## rss <- sum(w * r^2)
## r <- sqrt(w) * r
## }
## resvar <- rss/rdf
## if (is.finite(resvar) && resvar < (mean(f)^2 + var(c(f))) *
## 1e-30)
## warning("essentially perfect fit: summary may be unreliable")
## p1 <- 1L:p
## R <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
## se <- sqrt(diag(R) * resvar)
## est <- z$coefficients[Qr$pivot[p1]]
## tval <- est/se
## ans <- z[c("call", "terms", if (!is.null(z$weights)) "weights")]
## ans$residuals <- r
## ans$coefficients <- cbind(Estimate = est, `Std. Error` = se,
## `t value` = tval, `Pr(>|t|)` = 2 * pt(abs(tval), rdf,
## lower.tail = FALSE))
## ans$aliased <- is.na(z$coefficients)
## ans$sigma <- sqrt(resvar)
## ans$df <- c(p, rdf, NCOL(Qr$qr))
## if (p != attr(z$terms, "intercept")) {
## df.int <- if (attr(z$terms, "intercept"))
## 1L
## else 0L
## ans$r.squared <- mss/(mss + rss)
## ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n -
## df.int)/rdf)
## ans$fstatistic <- c(value = (mss/(p - df.int))/resvar,
## numdf = p - df.int, dendf = rdf)
## }
## else ans$r.squared <- ans$adj.r.squared <- 0
## ans$cov.unscaled <- R
## dimnames(ans$cov.unscaled) <- dimnames(ans$coefficients)[c(1,
## 1)]
## if (correlation) {
## ans$correlation <- (R * resvar)/outer(se, se)
## dimnames(ans$correlation) <- dimnames(ans$cov.unscaled)
## ans$symbolic.cor <- symbolic.cor
## }
## if (!is.null(z$na.action))
## ans$na.action <- z$na.action
## class(ans) <- "summary.lm"
## ans
## }
It is also important to know in what environment a function is
created. The environment()
function tells us.
environment(print)
## <environment: namespace:base>
environment(filter)
## <environment: R_GlobalEnv>
We’ll learn more about environments in the next chapter.
R uses lexical scoping. This means that when R looks for the value of names, it depends on the state of things when the function is defined.
R looks for variables inside a function definition first (local
variables), but if it can’t find them, it looks in the parent
environment. In this case, when my_fun()
is executed, R
looks for the value of global_var
. Many programming
languages will throw an error in this situation, because
global_var
is not defined in the environment created by
my_fun()
. However, R just keeps looking for
global_var
in the parent environment (which in this case is
the global environment).
<- 21
global_var
<- function(x) {
my_fun + global_var
x
}
my_fun(3)
## [1] 24
environment(my_fun)
## <environment: R_GlobalEnv>
Note also that if we try to re-define global_var
inside
the function definition, we are masking the name
global_var
.
<- function(x) {
my_fun2 <- 99
global_var + global_var
x
}
my_fun2(3)
## [1] 102
Note the difference reported by findGlobals()
.
::findGlobals(my_fun) codetools
## [1] "{" "+" "global_var"
::findGlobals(my_fun2) codetools
## [1] "{" "+" "<-"
my_fun()
in Python or
Java. Does it work?Local variables that are created inside a function never see the light of day in the parent environment.
<- function(x) {
my_fun3 <- 7
local_var + global_var
x
}
my_fun3(3)
## [1] 24
local_var
## Error in eval(expr, envir, enclos): object 'local_var' not found
Because of name masking, values of objects in the parent environment can affect the behavior of a function. This behavior is often useful in data analysis scripts, but can be problematic in more formal programming.
This means that if we reset the value of global_var
, we
change the behavior of my_fun()
.
<- 15
global_var
my_fun(3)
## [1] 18
Give an example of a situation in which dynamic lookup is helpful.
Give an example of a situation in which dynamic lookup could be problematic.
A function that depends only on its inputs is called pure.
Take a minute to think about what questions you still have about
subsetting. Review what questions have been posted (in the
#questions
channel) recently by other students and
either:
Prompt: Which topic(s) from the chapter on functions could use further elaboration?