Chapter 4 notes

Why things are normal

Let’s simulate an experiment. Let’s suppose 1,000 people flip a coin 16 times. What we see is that any process that adds together random values from the same distribution converges to a normal distribution. There are multiple ways to conceptualize why this happens, one, is becuase when we have extreme fluctuations, the more we sample, the more they tend to cancel each other out. A large positive fluctuation will cancel out a large negative one. Also from a combinatorical perspective there are more paths (or possible outcomes) that sum to 0 than say 10, but in general, the most likely sum is the one where all the fluctuations cancel each another out, leading to a sum of 0 relative to the mean.

pos <- replicate(1000, sum(runif(16, -1, 1)))
dens(pos, norm.comp = TRUE, col = "blue", main = "Results of 1,000 people flipping a coin 16 times")

Normality also occurs when we multiply, however this only works when the products are sufficiently small. For example let’s look at 12 loci that interact with one another to increase growth rates in organisms by some percentage. The reason this approaches normality is because the product of small numbers very closely approximated by their sum. Take for example, \(1.1 * 1.1 = 1.21\) this is nearly equivalent to \(1.1 + 1.1 = 1.2\). The smaller the effect of each locus, the better this sum approximation will be.

# prod(1 + runif(12, 0, 0.1))
growth <- replicate(1000, prod(1 + runif(12, 0.0, 0.1)))

# sample where the growth rate has a larger effect
growth.big <- replicate(1000, prod(1 + runif(12, 0.0, 0.5)))

# norm.comp = TRUE overlays the normal distribution
dens(growth, norm.comp = TRUE, col = "blue", main = "Small growth, between 0.0 and 0.1")
dens(growth.big, norm.comp = TRUE, col = "blue", main = "Large growth, between 0.0 and 0.5")

Even though large deviates multiplied together don’t produce Gaussian distributions, on a log scale they do.

log.big <- replicate(1000, log(prod(1 + runif(12, 0.0, 0.5))))
dens(log.big, norm.comp = TRUE, col = "blue", main = "Large growth on log scale")

A quick note on notation: The Gaussian is a continuous distribution whereas the binomial is discrete. Probability distributions with only discrete outcomes, like the binomial, are usually called probability mass functions and are denoted \(Pr()\). Continuous ones, like the Gaussian, are called probability densisty functions, denoted \(p()\).

Linear Regression

Let’s revisit the globe tossing model.

\[w \sim Binomial(N,p) \\ p \sim Uniform(0,1)\]

Interpretation: We say the outcome, \(w\) is distributed binomially with \(N\) trials, each one having chance \(p\) of success. And the prior \(p\) is distributed uniformly with mean 0 and standard deviation 1. It’s important to recognize that the binomial is the distribution of the data and the uniform is the distribution of the prior. This should make sense, in the previous homeworks we were constantly putting equal weight for all the prior values.

Now let’s look at the variables of a linear regression model. \[ \begin{array}{l} y_{i} \sim Normal(\mu_{i}, \sigma) \\ \mu_{i} = \beta x_{i} \\ \beta \sim Normal(0,10) \\ \sigma \sim Exponential(1) \\ x_{i} \sim Normal(0,1) \end{array} \]

We’ll keep this in mind and expand upon it later as we look through an example of height data of Kalahari foragers collected by Nancy Howell. We’re goin go focus on adult heights and we’re going to build a model to describe the gaussian distribution of these heights.

data("Howell1")
d <- Howell1

# look at adults over the age of 18
d %<>%
  subset(age > 18)
precis(d)
##               mean         sd      5.5%     94.5%       histogram
## height 154.6443688  7.7735641 142.87500 167.00500       ▁▃▇▇▅▇▂▁▁
## weight  45.0455429  6.4552197  35.28946  55.79536         ▁▅▇▇▃▂▁
## age     41.5397399 15.8093044  21.00000  70.02500 ▂▅▇▅▃▇▃▃▂▂▂▁▁▁▁
## male     0.4739884  0.5000461   0.00000   1.00000      ▇▁▁▁▁▁▁▁▁▇
dens(d$height, norm.comp = TRUE, col = "blue")

Looking at the outcome data, we can see that it resembles a Gaussian distribution, so we can assume that the model’s likelihood should be Gaussian too. In this example we’re going to statrt by saying an individual’s height \(h_{i}\) is distributed normally with mean \(\mu\) and standard deviation \(\sigma\).

\[h_{i} \sim Normal(\mu, \sigma)\]

Now \(\mu\) and \(\sigma\) have not been observed, we need to infer them from what we have measured, which in this case is \(h\). But we still need to define them, and these are going to act as our priors.

\[ \mu \sim Normal(178, 20) \\ \sigma \sim Uniform(0, 50) \]

Above we’re defining \(\mu\) as being normally distributed around 178 centimeters with a standard deviation of 20. We started with 178 as the mean height because that’s the height of Richard McElreath. This is a broad assumption, it means that 95% probability is between \(178\pm 40\) and while this might not be the best choice for a prior it’s good enough for now. For \(\sigma\) we’re saying that it is uniformly distributed from 0 to 50. We know it needs to be positive so bounding it at 0 makes sense, as for the upper bound, choosing a standard deviation of 50cm implies that 95% of individuals lie within 100cm of the average height.

Whatever your priors are, it’s always a good idea to plot them. This will give you a chance to see what assumptions your priors are building into the model.

curve(dnorm(x, 178, 20), from = 100, to = 250, main = "Mu")
curve(dunif(x, 0, 50), from = -10, to = 60, main = "Sigma")

Prior predictive distribution

Now we can use these priors and simulate before actually seeing the data. This is very powerful, it’s allowing us to see what our model believes before we feed it the data. We’re going to try and build good priors before seeing the data. Note that this is not p-hacking because we’re not using the data to educate the model.

Now to run this simulation all you need to do is sample values from the variables

sample_mu <- rnorm(1e4, 178, 20)
sample_sigma <- runif(1e4, 0, 50)
prior_h <- rnorm(1e4, sample_mu, sample_sigma)
dens(prior_h, norm.comp = TRUE, col = "red")

Looking at this we can see that the distribution is not quite normal, it’s actually a T distribution. There’s uncertainty about the standard deviation, which is why we see these fat tails. These might not be the best priors in the world but at least we’re in the realm of possibility.’’

Below we’ll consider the same example but with \(\mu\) having a standard deviation of 100. Now we can see that this distribution includes some pretty unrealistic possibilities. Like humans smaller than an unfertilized egg, and a large number of people taller than the tallest recorded man in history. We don’t need to have seen the data beforehand to know that we can do better than this distribution.

Grid approximation of the posterior distribution

Below we are going to use brute force to map out the posterior distribution. Often times this is an impractical approach as it’s computationally expensive. But it is worth knowing what the target actually looks like, before we start accepting approximations of it. The code below is

# Establish the range of mu and sigma
mu.list <- seq( from=140, to=160 , length.out=200 )
sigma.list <- seq( from=4 , to=9 , length.out=200 )

# Take all possible combinations of mu and sigma
post <- expand.grid( mu=mu.list , sigma=sigma.list )

# Compute the log-likelihood at each combination of mu and sigma. 
post$LL <- sapply( 1:nrow(post) , function(i) sum( dnorm(
                d$height ,
                mean=post$mu[i] ,
                sd=post$sigma[i] ,
                log=TRUE ) ) )

# Multiply the prior by the likelihood. since the priors and likelihood are on the log scale we add them together which is equivalent to multiplying. 
post$prod <- post$LL + dnorm( post$mu , 178 , 20 , TRUE ) +
    dunif( post$sigma , 0 , 50 , TRUE )

# Get back to the probability scale from the log scale. We scale all of the log-products by the maximum log-product
post$prob <- exp( post$prod - max(post$prod) )

# Plotting a simple heat map
# image_xyz( post$mu , post$sigma , post$prob )
plot_ly(
  x = post$mu, 
  y = post$sigma, 
  z = post$prob, 
  type = "contour",
  colorscale = list(
    c(0, 0.25, 0.5, 0.75, 1),
    c('black','purple','red','orange','white'))
  )