Testimonials
Correlation, the sister tool of regression
Do your inputs (X's) correlate to your outputs (Y's)
In a previous article we looked at regression - finding a relationship between your process inputs and outputs. In this article we explore correlation. Correlation is a measure of strength of association between two continuous variables (e.g. temperature and pressure or time of day and call rate). The strength of relationship can be expressed as a simple coefficient and used to determine if the association is significant or not i.e. does one variable have a significant influence on the other. This is essential to know if you are to rely on your regression equations to improve your process.
Correlation measures the degree of linearity between two variables and can be expressed by means of Pearson's correlation coefficient 'p' for whole populations or 'r' for sample populations. The calculation for 'r' for two variables x & y is:
Where
r = Pearson's correlation coefficient for samples
s = Standard Deviation of the data sets x and y
xi = each data point where 'i 'starts at 1 and ends at n i.e. all x points
yi = each data point where 'i 'starts at 1 and ends at n i.e. all y points
It is not necessary to know this equation to use Pearson's correlation coefficient since good software packages like Minitab and MS excel have standard functions for to calculate this (see later).
The value of 'r' can range between -1.0 and +1.0. Where +1.0 means a strong correlation and when x goes up y also goes up in proportion, -1.0 means that there is a strong negative correlation and when x goes up y goes down in proportion. When there is no correlation i.e. r=0.0, this means that x has no apparent affect on y. See the charts below to visualise this.
Now what about values of 'r' between 0 and 1.0 (or -1.0)? If your data is scattered and not a perfect straight line fit between the two variables your data may look like the following:
Calculating Pearson's Correlation Coefficient 'r'
The formula for rxy above can be used to find the correlation coefficient. In Excel use the formula =CORREL(arrayX,arrayY) to calculate 'r' between two variables stored in arrays X and Y (i.e. two rows or two columns of numbers).
In Minitab this feature can be found at Stat>Basic Statistics>Correlation as shown below.
Select your X and Y columns as variables and Minitab will calculate Pearson's correlations coefficient and the level of significance for you.
Precautions to Observe
Correlation relies on there being a linear relationship between the two variables. Always plot your data to check this. Also, correlation does not always imply causation! Look out for lurking variables in your data.
Consider the following graph:
There appears to be a reasonable correlation between Ice-lolly Sales and the number of shark attacks. Does this really mean that if you buy an Ice-lolly you are more likely to be attacked by a shark? No of course not! These two independent variables both correlate to warm weather. On nice warm days more people buy ice-lollys and more people swim in the sea making them more susceptible to shark attacks. The 'lurking' variable here is temperature. The operative word in this scenario is 'independent'. So make sure that your data is dependent and has sound engineering reasons for the relationship before trying to draw any deductions from a calculated 'r'.
Do not draw any conclusions based on the correlation coefficient until you have tested it for significance as follows. To determine significance you need two pieces of information; the number of degrees of freedom (equal to the sample size N minus 2) and the level of significance you wish to have.
A significance level of 0.05 means that you are willing to risk saying there was a relationship in your sample when there was not in fact one in your population 5 times out of 100. The table below shows Pearson's correlation coefficient 'r' against sample size and significance.
|
|
SIGNIFICANCE LEVEL |
||||
N |
df (N-2) |
0.10 |
0.05 |
0.02 |
0.01 |
0.00 |
3 |
1 |
0.988 |
0.997 |
1.000 |
1.000 |
1.000 |
5 |
3 |
0.805 |
0.878 |
0.934 |
0.959 |
1.000 |
12 |
10 |
0.497 |
0.576 |
0.658 |
0.708 |
0.866 |
17 |
15 |
0.412 |
0.482 |
0.558 |
0.606 |
0.728 |
22 |
20 |
0.360 |
0.423 |
0.492 |
0.537 |
0.640 |
32 |
30 |
0.296 |
0.349 |
0.409 |
0.449 |
0.530 |
52 |
50 |
0.231 |
0.273 |
0.322 |
0.354 |
0.416 |
82 |
80 |
0.183 |
0.217 |
0.256 |
0.283 |
0.331 |
102 |
100 |
0.164 |
0.195 |
0.230 |
0.254 |
0.297 |
From the above table you can see that the higher the number of samples (and therefore the higher the number of degrees of freedom) the lower the value of 'r' required for a correlation between your variables. Also if you can accept higher significance levels i.e. accept that you may conclude that there is a relationship when there really isn't one, the lower the value of 'r' required for a correlation between your variables.
The following equation provides a very conservative rule of thumb estimate of the minimum value of 'r' required to show a correlation between two variables:
For example if you had 14 data points then, using the above equation, you would require a value of 'r' of around 0.8 to be quite sure of a relationship. For significance levels of 0.05 a value of 2 can be used in the above equation.
In the above we have seen how correlation checks for a significant relationship between two linear variables and how to calculate Pearson's correlation coefficient and use this to determine significance. With this information at hand you can move forward with confidence to further analyse your data and develop your transfer function in order to improve your process.
Correlation is taught at yellow, green and black belt levels. SigmaPro work with organisations to identify, select and run Lean Six Sigma projects and support and coach staff to get the best out of their processes. Call us now for help in data analysis techniques such as correlation and improving your business critical processes, our see the web site www.sigmapro.co.uk to see details of our Lean Six Sigma training.
Author Biography
Dr David Cowburn - Lean Six Sigma Specialist
David has 25 years of running companies to Managing Director level and is experienced in utilising Lean Six Sigma in a wide variety of businesses including, manufacturing, process industry, service, and administrative.
In a people based hands-on style, he works and trains at all levels in an organisation from Board to shop floor to bring about rapid measurable step changes in performance.
David was originally trained in the Toyota Production System and has since developed a high level blend of Lean and Six Sigma philosophies and tools through working with businesses all around the World.
Quick Enquiry or Call Back
Get in touch, ask us anything! How can we help?


