<- function(x,y){
sum.of.squares ^2 + y^2
x }
Introduction to Simulation
Writing your own R functions
Abstracting your code into many small functions is key for writing nice R code. Many people are initially reluctant to create their own functions. R has many built in functions and you can access many more by installing new packages. So there is no doubt that you are already using functions. This section will show you how to write your own functions.
Writing a custom function is similar to writing down a mathematical formula with variables. Here is an example function called sum.of.squares
which requires two arguments and returns the sum of the squares of these arguments.
Once you run this code chunk the function is now available, like any of the built in functions within R. Try running sum.of.squares(3,4)
in your console.
sum.of.squares(3,4)
[1] 25
The procedure for writing any other functions is similar and involves three key steps:
- Define the function
- Load the function into the R session
- Use the function
Defining functions
Functions are defined by code with a specific format:
<- function(arg1, arg2, arg3=2, ...){
function.name <- sin(arg1) + sin(arg2) # do some useful stuff
newVar / arg3 # the result of the last line is the returned value
newVar }
Terminology:
function.name
is the function’s name. This can be any valid variable name but you should avoid using names that are used elsewhere in R.arg1,arg2,arg3
are the arguments of the function. You can write a function with any number of arguments. - Arguments are the needed in order to make the function run. For instance, for thesum.of.squares
function, x and y are the two arguments because R needs two numbers in order to make the function work.- You can find out the default arguments for any base R function in the help file.
- Function body: The function code between the
{}
brackets is run every time the function is called. This code might be very long or very short. - Return value: The last line of code is the value that will be returned by the function.
- Load the function: For R to be able to execute your function, it needs first to be read into memory. Similar to loading a library, until you do it functions contained within it cannot be called.
- To run a function you highlight the function and run it. Once you run it you don’t need to run it again.
- Using your function: You can now use your function. You just need to give the proper arguments and the function will return the desired value(s). This may also be called “calling” your function.
Simulation
What is Simulation?
Simulation is a way to use high-speed computer power to substitute for analytical calculation. The law of large numbers tells us that if we observe a large sample of i.i.d. random variables with finite mean, then the average of these random variables should be close to their true mean. If we can get a computer to produce such a large i.i.d. sample, then we can average the random variables instead of trying to calculate their mean analytically. Simulations can be useful to check assumptions, examine a sampling distribution, or check the validity of the CLT in certain scenarios.
The mean of the Uniform distribution on the interval [0,1] is known to be 1/2. How could we have found that out if we did not know the functional form of the uniform distribution?
- Create a large random sample from a Uniform distribution on the interval [0,1]
- Calculate the observed mean of that sample.
<- runif(10000, 0, 1) # runif(n, min, max)
samp.unif mean(samp.unif)
[1] 0.4979474
Why is simulation useful?
Statistical simulations are used to estimate features of distributions such as means of functions, quantiles, and other features that we cannot compute in closed form. When using a simulation estimator, it is good to compute a measure of how precise the estimator is, in addition to the estimate itself.
Simulation is a technique that can be used to help shed light on how a complicated system works even if detailed analysis is unavailable. For example, engineers can simulate traffic patterns in the vicinity of a construction project to see what effects various proposed restrictions might have. Statistical simulations are used to estimate probabilistic features of our models that we cannot compute analytically.
Two servers A and B in a fast food restaurant start serving customers at the same time. They agree to meet for a break after each of them has served 10 customers. Presumably, one of them will finish before the other and have to wait. How long, on average, will one of the servers have to wait for the other?
Suppose that we model all service times, regardless of the server as a random variable with an exponential distribution with parameter 0.3 customers per minute (\(Exp(.3)\)). The time it takes one server to serve 10 customers would follow a gamma distribution with parameters 10 and 0.3 (\(Gamma(10, .3)\)).
Let \(X\) be the time it takes \(A\) to serve 10 customers, and let \(Y\) be the time it takes server \(B\) to serve 10 customers. We are asked to compute the average difference in times, which is written as
\[ E(|X-Y|) \]
The most straightforward way of finding this mean analytically would require a two-dimensional integral over the union of two non-rectangular regions.
On the other hand, a computer can provide us with as many independent gamma random variables as we desire. We can then obtain a pair \((X,Y)\) and compute \(Z=|X-Y|\). We can repeat this process many times (10000 in the example below) and get an average of all the observed \(Z\) values.
<- rgamma(10000, 10, .3)
X <- rgamma(10000, 10, .3)
Y <- abs(X-Y)
Z mean(Z)
[1] 11.7168
Variability in simulation results
An important part of simulation is to assess the uncertainty in simulations. Even if we did a large number of simulations, if we repeated the simulation process we would get slightly different results. What if we wanted to repeat the above simulation several times? We can keep clicking the run button and record the results each time. The function replicate()
will do this for us and much more efficiently.
Using replicate
to repeat simulations
To use the function replicate()
, wrap the desired code in brackets and tell R how many times we want to repeat (or replicate) the experiment.
replicate(n=5, {
<- runif(10000, 0, 1)
samp.unif mean(samp.unif)
})
[1] 0.4974733 0.4975519 0.4999091 0.5029264 0.5001993
We will come back to this idea of repeating simulations multiple times to explore the variability in simulation results in a few sections.
Basic Sampling
In the above examples, we were sampling values from known distributions, like from the normal or gamma distribution. Sometimes we will want to draw randomly from a list of objects. The sample()
function is used for this.
sample(1:6, size=5) # without replacement
[1] 2 5 6 1 3
sample(1:6, size=5, replace=TRUE) # with replacement
[1] 2 4 1 2 4
Wrap up
The functions demonstrated in this section are to provide examples of functions and tactics we will be using throughout the semester.