sigma pro

Testimonials

Regression - the key to the transfer function

Finding an equation to fit your data

Regression is at the heart of Lean Six Sigma as is enables us to explore the existence of relationships between variables with the aid of data, determine which input variables have the biggest impact on the output response, build a transfer function describing the nature of relationship between the inputs and outputs of a process and make predictions about future process performance.

This article shows how to fit an equation to continuous data and highlights some do's and dont's of regression.

Being able to describe how your process works can be really useful as a key step to being able to improve it. Knowing what settings to use and what the response will be is a great position to be in as some of the time we don't have a clue how a process really works, and most of the time we are not really sure!

Regression is the technique of fitting a mathematical equation (called a transfer function) to data in order to be able to predict an output from a given input or inputs. For example; is there a relationship between shelf space and sales; what are the best settings for wheel speed, oil viscosity and wheel hardness to maximise grinding output; is there a relationship between pay and time in the job?

So how do we go about fitting an equation to data? The simplest form of relationship is a straight line or linear relationship between a single input variable X and the output variable Y.

The equation of a straight line is Y = a + bX where 'a' is where the line crosses the Y axis and b is the slope of the line.

The simplest way to find this equation is to plot your data on an XY graph with Y up the vertical axis and X on the horizontal axis. If there is a linear relationship between X and Y you will be able to see a straight line relationship between the data points. The more data points fit on that straight line the better is the fit. If you draw a line of best fit by eye through the data points and extend the line back to cross the Y axis this is the value of 'a'. The slope 'b' is Y/X. For one unit of X how much does Y increase or decrease and use this to calculate Y/X.

Let's have a look at an example. Supposing that a company pays its employees according to a pay scale with bands of £4.77, £6.00, £7.00, £9.00, £11.00, £15.00, £18.00, £20.00 & £25.0 per hour. Where £4.77 represents the national minimum wage. With age and experience an employee can expect to move up the pay scale. However not all employees progress at the same pace. In the graph shown in figure 1 below, all employees pay rates (vertical Y axis) are plotted against their years of experience (horizontal X axis) in the job.

rate per hour

Fig. 1 Linear Regression - Pay Rate vs Experience

Notice that the data looks to roughly fit a straight line. Some employees are on the same rate of pay but have a different number of year's experience. If we wanted to find out what the average progression in pay was for an employee we can use linear (straight line) regression to find out the relationship between experience and pay rate.

The straight line in fig. 1 has been drawn in by hand to best fit most of the data. Using the equation for a straight line, Y= a + bX, 'a' is where the line crosses the 'Y' axis and can be seen to be approximately 5 and 'b' is the slope of the line and is approximately 1. So the equation which fits this data is Y= 5 + X. To check that this is correct, take an employee who has worked for 10 years, he/she would expect to receive 5 + 10 = 15 per hour pay rate. By reading the rate per hour from the graph where the 10 years experience point meets the sloping line, we can see that the rate is 15. In regression this is called 'interpolation'.

Now just a word of caution because using the formula to predict what pay one would get outside the range of the data would be dangerous. For example after 30 years service, we might predict that an employee would earn £35 per hour. However, if the company's top level of pay was £25 per hour this would be incorrect. Also you could use the equation with negative X's but this would be clearly nonsense in this case since there is no such thing as negative year's experience.

Using the equation beyond the range of the data is called 'extrapolation' and should only be done when you are confident that the equation will work in this range. The purists would say not to use extrapolation at all!

In practice, drawing a straight line through data by hand and 'eye' is reasonably accurate if the data is not too spread out but fortunately we have some easy to use tools to help us do this more accurately even when the data is more scattered.

Microsoft Excel and Minitab can both draw scatter plots (X vs Y graphs) and fit lines to the data and provide the regression equation. The graph in fig. 2 below is drawn using Minitab and the Fitted Line Plot function. The equation is given as Rate per Hour (Y) = 4.718 + 1.153 Experience Years (X), which can be seen to be close to the hand drawn solution in fig. 1 above. The R-Sq figure of 95.6% given in the data box shows the percentage of variation explained by the equation and represents how well the fitted line represents the data. The closer to 100% the better is the fit.

fitted line plot

Fig. 2 Linear Regression - Pay Rate vs Experience using Minitab

To fit a linear regression to data in Excel, first plot the data using the 'Scatter' graph function and then, depending on the version of Excel, you can add a Trend Line. Select the linear type (default) and check the options to 'Display Equation on chart' and 'R-squared value on chart' and Excel will fit the line and provide the equation and the R-Sq value as in Minitab.

Just a couple of words of warnings about fitting equations to data. If the data is very scattered it may not be easy to find the best line to draw between the data and there may appear to be several alternative lines that may fit. In this case take note of the R Sq figure and if it is less than 70%, the equation of the line may not be a reliable representation of the data. Ideally an R Sq value greater than 80% is required to feel confident that the regression analysis is representative.

Secondly, be aware of the data you have plotted and the relationships shown. The data may appear to show a relationship but be totally independent. For example, as the sale of ice creams increases, so do shark attacks! This does not mean that selling ice creams causes shark attacks. In fact there is another variable, i.e. warm weather, which is related to both events but is in fact an independent variable.

So now we have a method of carrying out linear regression what about data that does not follow a straight line? A linear regression is called a 'first order' regression because in the equation Y = a + bX the power of X is 1. For non-straight line data, higher powers of X are required to describe the data. A second order equation would be Y = a + bX +cX2. An example of a second order regression can be seen in fig. 3 below.

fitted line plot 2

Fig. 3 Second Order Regression - Output vs Input using Minitab

A third order equation, Y = a + bX + cX2 + dX3, would be required to describe data with a peak and a trough i.e. two turns in the data, and a fourth order equation is required for data with three turns etc. Excel can also cope with higher order regression plots and other forms of relationship.

Finally, let us consider if there is more than one input variable (X) e.g. Y = a1 + b1X1 + b2X2. This is called multiple regression. Here Minitab comes in to its own and can handle a large number of input variables and tell you which variables are significant plus a range of other useful insights.

An example of where multiple regression was useful recently was in a moulding company that made different shaped and sized polystyrene packaging. The problem was that they did not know precisely how long each moulding needed in the oven before it was dry and consequently set the oven time longer to ensure all products were dry. The moulding process involves injecting steam in to the material.

By measuring the weight of the parts before and after drying and recording the time in the oven, the company used multiple regression to derive a transfer function that described the relationship between surface area, thickness and density of the mouldings and the amount of water lost in the drying process and could therefore predict how long each part would take to dry. Oven drying times were reduced by up to 20%, saving on energy and increasing throughput.

Regression is a useful tool in developing a transform function to describe how your process is behaving and therefore how you can influence and control it. Regression can be employed to screen important from unimportant input variables, quantify the individual terms and predict an output.

Linear regression is taught at green belt level and multiple regression at black belt. SigmaPro work with organisations to identify, select and run Lean Six Sigma projects and support and coach staff to get the best out of their processes. Call us now for help in data analysis techniques such a regression and improving your business critical processes.

Author Bipgraphy

Dr David Cowburn

Lean Six Sigma Specialist

David has 25 years of running companies to Managing Director level and is experienced in utilising Lean Six Sigma in a wide variety of businesses including, manufacturing, process industry, service, and administrative.

In a people based hands-on style, he works and trains at all levels in an organisation from Board to shop floor to bring about rapid measurable step changes in performance.

David was originally trained in the Toyota Production System whilst at GKN and has since developed a high level blend of Lean and Six Sigma philosophies and tools through working with businesses all over the World.

Download Your FREE SigmaPro Guide NOW!


Your 6 Step Quick Guide To Deploying Lean Six Sigma And Achieving Success!