Grok Correlation
November 23, 2019
Correlation has never quite sat right in my mind. I understand its intuitive meaning, I know its bounds, and I know how to interpret it, but it does not have the same natural feel to me as the mean and variance calculations.
So, to satisfy my mental hunger for correlation to rest better in my mind, I'm going to dive deep into correlation, the components of its calculation, and how they impact the correlation calculation itself.
The term correlation generally refers to the "co-relationship" between two variables. Mathematically, we will look at a specific formula that has been defined to measure and quantify the relationship between two variables, the Pearson Product Moment Correlation Coefficient.
Lets start by looking at the equation for correlation to make sense of it, especially the "Product Moment" part.
On first glance the definition of Pearson's Correlation from Wikipedia is not very useful or intuitive:
The correlation coefficient \(\rho_{XY}\) between two random variables \(X\) and \(Y\) with expected values \(\mu_{X}\) and \(\mu_{Y}\) and standard deviations \(\sigma_{X}\) and \(\sigma_{Y}\) is defined as
$${\rho_{XY} = \frac{cov(X,Y)}{\sigma_{X}\sigma_{Y}}}$$
$${= \frac{E[(X - \mu_{X})(Y - \mu_{Y})]}{\sigma_{X}\sigma_{Y}}}$$
I like the following definiton better, even though it is just a light tweak to the previous formula. I prefer this way of thinking about correlation because it puts everything into perspective by using data only. So you don't have to think about relationships between expected values, means, and standard deviations when reading the formula, you only have to think about the computation using the set of data points \((X_{i}, Y_{i})\), and the means \(\mu_{X}\) and \(\mu_{Y}\).
$${\rho_{XY} = \frac{\sum_{i=1}^{n}{(X_{i}-\mu_{X})(Y_{i}-\mu_{Y})}}{\left[\sum_{i=1}^{n}{(X_{i}-\mu_{X})^{2}}(Y_{i}-\mu_{Y})^{2}\right]^{\frac{1}{2}}}}$$
This formula clearly defines correlation as a ratio of meausres of \(X_{i}s\) and \(Y_{i}s\) only.
Lets discect this equation to make sense of it. I think it is easiest to start with the denominator.
The Denominator
When thinking about the denominator, I think it is first useful to draw an analogy between the calcuations for Z-Score and correlation. Remember that the Z-Score is the number of standard deviations between a varible and its mean. We achieve computing the Z-Score value in standard deviations by dividing by the standard deviation of the variable.
Similarly, to understand Pearson's correlation, we really just need to understand what the numerator is doing, then we know that we are just turning the numerator measurement into units of the denominator.
The Numerator
The numerator is the covariance between the variables \(X\) and \(Y\). The covariance is the average number of square units that the product \(X_{i}Y_{i}\) of any point \((X_{i}, Y_{i})\) is from the joint mean \((\mu_{X}, \mu_{Y})\). This is nice, because by measuring sum of the product of each point \(X_{i}Y_{i}\), we are also measuring the strength of the linear relationship between \(X\) and \(Y\). This is because if \(X\) and \(Y\) are both large on average, then the covariance will also be large. While if one is large and positive, when the other is large and negative, then the covariance will be large and negative (and the correlation will also be negative). Notice that with the covariance measure, the strength of the linear relationship is un-bounded, so covarince can take any value in the interval \((-\infty, +\infty)\).
Note: In the correlation formula, you don't actually see the full covariance formula in the numerator, because when dividing by the product of the variances \(\sigma_{X}\sigma_{Y}\), a \(\frac{1}{n}\) term is cancelled from the covariacne formula in the numerator, and from the \(\sigma_{X}\sigma_{Y}\) formula in the denominator.
Here is a quick breakown of where the \(\frac{1}{n}\) term goes:
Starting with the formulas for covariance and the product of standard deviations \(\sigma_{X}\sigma_{Y}\).
$${cov(X,Y) = \frac{1}{n}\sum_{i=1}^{n}{(X_{i}-\mu_{X})(Y_{i}-\mu_{Y})}}$$
$${\sigma_{X}\sigma_{Y} = \left[\frac{1}{n}\sum_{i=1}^{n}{ (X_{i}-\mu_{X})^{2}}\right]^{\frac{1}{2}} \left[\frac{1}{n}\sum_{i=1}^{n}{ (Y_{i}-\mu_{Y})^{2}}\right]^{\frac{1}{2}} }$$
$${= \frac{1}{n}\left[\sum_{i=1}^{n}{(X_{i}-\mu_{X})^{2}}\sum_{i=1}^{n}{(Y_{i}-\mu_{Y})^{2}}\right]^{\frac{1}{2}}}$$
Using these formulas, we can see that the \(\frac{1}{n}\) terms in the numerator and denominator cancel, leaving the correlation formula.
$${\frac{cov(X,Y)}{\sigma_{X}\sigma_{Y}} = \frac{\frac{1}{n}\sum_{i=1}^{n}{(X_{i}-\mu_{X})(Y_{i}-\mu_{Y})}}{\frac{1}{n}\left[\sum_{i=1}^{n}{(X_{i}-\mu_{X})^{2}}\sum_{i=1}^{n}{(Y_{i}-\mu_{Y})^{2}}\right]^{\frac{1}{2}}}}$$
$${= \frac{\sum_{i=1}^{n}{(X_{i}-\mu_{X})(Y_{i}-\mu_{Y})}}{\left[\sum_{i=1}^{n}{(X_{i}-\mu_{X})^{2}}(Y_{i}-\mu_{Y})^{2}\right]^{\frac{1}{2}}} = \rho_{XY}}$$
Back to the Denominator...
So... I lied earlier when I said that by dividing the covariance \(cov(X,Y)\) by \(\sigma_{X}\sigma_{Y}\), we turned the covariance measurement into units of the denominator. What we are really doing is cancelling out the units in the numerator using the units in the denominator, and simulatneously bounding the covariance measurement in the range \([-1,1]\), yielding the unitless correlation measurement.
So, the units are removed from the covariance to produce the unitless correlation measure, because the numerator and denominator both have \(units_{X} * units_{Y}\).
How does dividing by \(\sigma_{X}\sigma_{X}\) bound the correlation measure in [-1, 1]
We will look at the relationship between the \(X_{i}-\mu_{X}\) terms in the covariance calculation and the \(X_{i}-\mu_{X}\) terms from the \(\sigma_{X}\) calculation to get an intuitive understanding of how division by \(\sigma_{X}\sigma_{X}\) bounds the correlation measure in \([-1, 1]\).
First, notice that there is a \(\sum_{i=1}^{n}{(X_{i} - \mu_{X})}\) term in the numerator of the correlation formula, and a \(\left[\sum_{i=1}^{n}{(X_{i} - \mu_{X}})^{2}\right]^{\frac{1}{2}}\) in the denominator. This means that as the \(X_{i}-\mu_{X}\) terms grow smaller/larger to \(-\infty\)/\(+\infty\), then the \(\left[\sum_{i=1}^{n}{(X_{i} - \mu_{X}})^{2}\right]^{\frac{1}{2}}\) term grows to \(-\infty\)/\(+\infty\) at the same rate. So then the ratio of these two terms converges to \(-1\) as \(\sum_{i=1}^{n}{(X_{i} - \mu_{X})}\) converges to \(-\infty\) and the ratio converges to \(+1\) as \(\sum_{i=1}^{n}{(X_{i} - \mu_{X})}\) converges to \(+\infty\). Since the same relationship holds true for \(Y\), we know that as either \(X_{i}-\mu_{X}\) or \(Y_{i}-\mu_{Y}\) approaches \(-\infty\) or \(+\infty\), the correlation calculation apporaches \(-1\) or \(+1\).
So why do we want to count in units of \(\sigma_{X}\sigma_{Y}\)?
When the covariance is divided by the product of the standard deviations, the range of the covariance is restricted to the interval \([-1, 1]\). This allows nice things, like comparing two different correlations, where you could not really compare two covariances before standardizing them.
So if I told you that two variables have a covariance of 10, is that good or bad? You can't really tell unless you know the standard deviations (we just showed that the covariance cannot be greater than the product of the standard deviations), but if I told you that two variables have a Pearson correlation of 0.95, then you immediately know that there is a strong positive linear relationship between the two variables.
The relationship between \(\rho_{XY}\), \(\sigma_{X}\), and \(\sigma_{Y}\)
To understand correlation, it is also important to understand how correlation changes as the volatilites of the underlying variables \(X\) and \(Y\) change. So let's look at this with regression in mind. In the formulas below \(\beta\) is the slope of the linear regression \(Y = \alpha + \beta X\).
In terms of linear regression, the correlation between \(X\) and \(Y\) can be expressed as
$${\rho_{XY} = \beta\frac{\sigma_{X}}{\sigma_{Y}}}$$
Let's shock some of the components of this equation to some special cases (like their extremes) to see how they affect \(\rho_{XY}\).
Case 1: \(\beta = 0\)
When \(\beta = 0\) then there is no linear relationship between \(X\) and \(Y\), so we expect there to be no correlation as well, i.e. \(\rho_{XY} = 0\). The scatterplot below shows this intuitive relationship, you can see that for any \(X\), \(Y\) is always expected to be the same value \(\mu_{Y}\). Note that in this case, we are not saying anything about the variance of \(Y\), but in the chart below we assume that it is positive.
Negative Chart
Case 2: \(\sigma_{X} = 0\)
When \(X\) has no variance, then it is always the same value \(\mu_{X}\), and can expect any \(Y\) to be related to the only \(X\) value available, \(\mu_{X}\). This means that there is also no correlation between the two variables.
Negative Chart
Case 3: \(\sigma_{Y} = 0\)
This is very similar to the case of the Case 1, because when \(\sigma_{Y} = 0\) then \(beta\) must also be equal to 0. So as with Case 1, we also expect \(\rho_{XY}\) to be \(0\), because there is no relationship between \(X\) and \(Y\). Note that unlike in Case 1, here we are making an assumption about \(\sigma_{Y}\). Basically, the only difference between Case 3 and Case 1 is that \(\sigma_{Y} > 0\) in Case 1, and \(\sigma_{Y} = 0\) in Case 3.
Negative Chart
Case 4: \(\sigma_{X} = \sigma_{Y}\)
Here we are left with \(\rho_{XY} = \beta\). Breaking it down, this means that there cannot be any relationship between the two variables from their variances, because the variances are equal. So the only other possibility that can impact a relationship between the two must be non-random. The only non-random component of the linear relationship \(Y = \alpha + \beta X\) we have left in the correlation forumla is \(\beta\).
Negative Chart
So generally as \(\sigma_{X}\) grows for a given \(\sigma_{Y}\) and \(\beta\), \(\rho_{XY}\) grows also
I hope that this helps you to understand correlation as much as it helped me while writing this article. Below are the functions that I used to generate charts above if you are interested in building them for yourself.
import numpy as np import matplotlib.pyplot as plt import scipy.stats as stats def generate_corr_plot(cov, name=None): sample_size = 500 x, y = np.random.multivariate_normal([0, 0], cov, sample_size).T beta, intercept, r_value, p_value, std_err = stats.linregress(x, y) plt.title( r'$\rho = $' + str(round(r_value, 2)) + r'$, \beta = $' + str(round(beta, 2)) + r'$, N = $' + str(sample_size) ) plt.scatter(x, y, color='#146054', marker='o') plt.xlabel('X') plt.ylabel('Y') #plt.axis('equal') if name == None: else: plt.savefig(name) def generate_no_corr_plot(name=None): generate_corr_plot([[1, 0], [0, 1]], name) def generate_zero_x_variance_plot(name=None): generate_corr_plot([[0, 0], [0, 1]], name) def generate_zero_y_variance_plot(name=None): generate_corr_plot([[1, 0], [0, 0]], name) def generate_equal_variance_plot(name=None): generate_corr_plot([[5, 3], [3, 5]], name) def generate_high_positive_corr_plot(): generate_corr_plot( [[80, 10], [10, 1]] ) def generate_high_negative_corr_plot(): generate_corr_plot( [[80, -10], [-10, 1]] ) # mean is a vector of means, ex. [0, 0] # cov covariance matrix, ex. [[1, 0.5], [0.5, 1]] def generate_correlation_plot(mean, cov, sample_size): x, y = np.random.multivariate_normal(mean, cov, sample_size).T plt.plot(x, y, 'x') plt.axis('equal')