Introduction to Simulation

Foundations of writing your own custom functions, drawing samples and conducting a simulation.
Author

COMPLETE

Published

January 25, 2023

Writing your own R functions

Abstracting your code into many small functions is key for writing nice R code. Many people are initially reluctant to create their own functions. R has many built in functions and you can access many more by installing new packages. So there is no doubt that you are already using functions. This section will show you how to write your own functions.

Writing a custom function is similar to writing down a mathematical formula with variables. Here is an example function called sum.of.squares which requires two arguments and returns the sum of the squares of these arguments.

sum.of.squares <- function(x,y){
  x^2 + y^2
}

Once you run this code chunk the function is now available, like any of the built in functions within R. Try running sum.of.squares(3,4) in your console.

sum.of.squares(3,4)
[1] 25

The procedure for writing any other functions is similar and involves three key steps:

  1. Define the function
  2. Load the function into the R session
  3. Use the function

Defining functions

Functions are defined by code with a specific format:

function.name <- function(arg1, arg2, arg3=2, ...){
  newVar <- sin(arg1) + sin(arg2) # do some useful stuff
  newVar/ arg3   # the result of the last line is the returned value
}

Terminology:

  • function.name is the function’s name. This can be any valid variable name but you should avoid using names that are used elsewhere in R.
  • arg1,arg2,arg3 are the arguments of the function. You can write a function with any number of arguments. - Arguments are the needed in order to make the function run. For instance, for the sum.of.squares function, x and y are the two arguments because R needs two numbers in order to make the function work.
    • You can find out the default arguments for any base R function in the help file.
  • Function body: The function code between the {} brackets is run every time the function is called. This code might be very long or very short.
  • Return value: The last line of code is the value that will be returned by the function.
  • Load the function: For R to be able to execute your function, it needs first to be read into memory. Similar to loading a library, until you do it functions contained within it cannot be called.
    • To run a function you highlight the function and run it. Once you run it you don’t need to run it again.
  • Using your function: You can now use your function. You just need to give the proper arguments and the function will return the desired value(s). This may also be called “calling” your function.
You try it. Write functions for the following.
  1. Converts degrees Fahrenheit to degrees Celsius. The formula for the conversion is \(C=(F-32)*5/9\). Note that F and c are special characters in R and should not be used as variable names.

  2. Compute the population standard deviation of c(320,50,88,98,215,128) using the following formula: \(\frac{1}{n}\sqrt{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}\)

F.to.C <- function(fahr){
  (fahr - 32)*5/9
}

F.to.C(85) # mmm warm.
[1] 29.44444
calc.pop.sd <- function(x){
  n <- length(x)
  mean.x <- sum(x)/n
  squared.diff <- (x-mean.x)^2
  (1/n)*sqrt(sum(squared.diff))
}

calc.pop.sd(x=c(320,50,88,98,215,128))
[1] 37.3262

Simulation

What is Simulation?

Simulation is a way to use high-speed computer power to substitute for analytical calculation. The law of large numbers tells us that if we observe a large sample of i.i.d. random variables with finite mean, then the average of these random variables should be close to their true mean. If we can get a computer to produce such a large i.i.d. sample, then we can average the random variables instead of trying to calculate their mean analytically. Simulations can be useful to check assumptions, examine a sampling distribution, or check the validity of the CLT in certain scenarios.

Notes on the Uniform Distribution https://math350.netlify.app/notes/cn4_5_unif_exp
Example: Finding the mean of a distribution.

The mean of the Uniform distribution on the interval [0,1] is known to be 1/2. How could we have found that out if we did not know the functional form of the uniform distribution?

  1. Create a large random sample from a Uniform distribution on the interval [0,1]
  2. Calculate the observed mean of that sample.
samp.unif <- runif(10000, 0, 1) # runif(n, min, max)
mean(samp.unif)
[1] 0.4979474
You try it
  1. Use simulation to approximate the mean of the exponential distribution with parameter 1. Code starter: rexp(n=____, 1)
  2. Let \(X\) stand for a random variable having the normal distribution with mean 2 and variance 49, \(X\sim N(2, 49)\). Estimate the mean of \(Z = log|X| +1\) using simulation. Code starter: rnorm(n=____, mean=___, sd=___), and you’ll need the absolute value function: abs()
samp.exp <- rexp(10000, 1)
mean(samp.exp)
[1] 1.012539
samp.norm <- rnorm(10000, 2, 49)
Z = log(abs(samp.norm))+1
mean(Z)
[1] 4.273311

Why is simulation useful?

Statistical simulations are used to estimate features of distributions such as means of functions, quantiles, and other features that we cannot compute in closed form. When using a simulation estimator, it is good to compute a measure of how precise the estimator is, in addition to the estimate itself.

Simulation is a technique that can be used to help shed light on how a complicated system works even if detailed analysis is unavailable. For example, engineers can simulate traffic patterns in the vicinity of a construction project to see what effects various proposed restrictions might have. Statistical simulations are used to estimate probabilistic features of our models that we cannot compute analytically.

Example: Server wait time

Two servers A and B in a fast food restaurant start serving customers at the same time. They agree to meet for a break after each of them has served 10 customers. Presumably, one of them will finish before the other and have to wait. How long, on average, will one of the servers have to wait for the other?

Suppose that we model all service times, regardless of the server as a random variable with an exponential distribution with parameter 0.3 customers per minute (\(Exp(.3)\)). The time it takes one server to serve 10 customers would follow a gamma distribution with parameters 10 and 0.3 (\(Gamma(10, .3)\)).

Let \(X\) be the time it takes \(A\) to serve 10 customers, and let \(Y\) be the time it takes server \(B\) to serve 10 customers. We are asked to compute the average difference in times, which is written as

\[ E(|X-Y|) \]

The most straightforward way of finding this mean analytically would require a two-dimensional integral over the union of two non-rectangular regions.

On the other hand, a computer can provide us with as many independent gamma random variables as we desire. We can then obtain a pair \((X,Y)\) and compute \(Z=|X-Y|\). We can repeat this process many times (10000 in the example below) and get an average of all the observed \(Z\) values.

X <- rgamma(10000, 10, .3)
Y <- rgamma(10000, 10, .3)
Z <- abs(X-Y)
mean(Z)
[1] 11.7168

Variability in simulation results

An important part of simulation is to assess the uncertainty in simulations. Even if we did a large number of simulations, if we repeated the simulation process we would get slightly different results. What if we wanted to repeat the above simulation several times? We can keep clicking the run button and record the results each time. The function replicate() will do this for us and much more efficiently.

Using replicate to repeat simulations

To use the function replicate(), wrap the desired code in brackets and tell R how many times we want to repeat (or replicate) the experiment.

replicate(n=5, {
  samp.unif <- runif(10000, 0, 1)
  mean(samp.unif)
})
[1] 0.4974733 0.4975519 0.4999091 0.5029264 0.5001993

We will come back to this idea of repeating simulations multiple times to explore the variability in simulation results in a few sections.

You try it:

Use replicate to compute the average difference in server wait times for five different simulations.

replicate(n=5, {
  X <- rgamma(10000, 10, .3)
  Y <- rgamma(10000, 10, .3)
  Z <- abs(X-Y)
  mean(Z)  
})
[1] 11.87032 11.68123 11.79934 11.78951 11.75983

Basic Sampling

In the above examples, we were sampling values from known distributions, like from the normal or gamma distribution. Sometimes we will want to draw randomly from a list of objects. The sample() function is used for this.

Example:
sample(1:6, size=5) # without replacement
[1] 2 5 6 1 3
sample(1:6, size=5, replace=TRUE) # with replacement
[1] 2 4 1 2 4

Wrap up

The functions demonstrated in this section are to provide examples of functions and tactics we will be using throughout the semester.