CHAPTER-14: INTRODUCTION TO REGRESSION ANALYSIS
In a data set of bivariate distribution, there present a set of pairs of observations where each pair of the observations is expressed with numerical values of two variables. Telling alternatively, the bivariate distribution is intended in finding or analyzing relationship between two variables under study. In any scientific studies, the basic interest of the researchers is to find out the possible co-movement of two or more than variables under study. In the process of co-movement determination, there exist two important statistical tools popularly called as correlation analysis and regression analysis. Correlation analysis simply, is a measure of association between two or more variables under study. Where as regression analysis examine the nature or direction of association between two variables. Regression analysis is analyzed by classifying the variables in two classes like the dependent variables and the independent variables. Thus it tries to estimate the average value of one variable (dependent variable) from the given value of the other variable(s) (i.e., independent variables). Where as, the condition of correlation analysis is exactly the contrast of the regression analysis. In such a case the basic focus of the researcher is on measurement of the strength of relationship between the variables. In other wards the correlation analysis measures the depth of relationship between two variables where as the regression analysis measures the width of the relationship between the variables. Again in regression analysis, the dependent variables are considered as random or stochastic and the independent variable(s) are assumed to be fixed or non-random. But in the correlation analysis all the variables are treated as symmentric and hence are considered as random.
INTRODUCTION TO CORRELATION ANALYSIS
The magnitude of association or relationship between the two variables can be measured by calculating correlation. Correlation analysis can be defined as a quantative measure of strength of relationship that exists between two variables. There are four types of relationship that may exists between two variables. They are:
- Positive correlation
- Negative correlation
- Linear correlation and
- Non-linear correlation.
1. Positive correlation:
Two variables are said to be positively correlated when the movement of the one variable lead to the movement of the other variable in the same direction. In other wards there exists direct relationship between the two variables. For example, the relationship between height of the human being to their corresponding weight, income of the person with expenditure, price of the commodities and supply of the commodity etc. In all such cases increase (or decrease) in the value of one variable leads to the increase (or decrease) in the value of corresponding other variable. The nature of positive relationship between the two variables can also be shown graphically. If the data are inserted in two axis of a graph paper, then one will find an upward trend rising from the lower left hand corner of the graph paper and spreading upward upto the upper right hand corner. One can imagine the supply curve as explained in the economic theory.
2. Negative correlation:
On the other hand, correlation between two variables is said to be negative when the movement of one variable leads to the movement in the other variable in the opposite direction. Here there exists inverse relationship between the two variables. For example, volume and pressure of perfect gas, income and expenditure on food items (Engelâ€™s law), change in price and quantity demanded of necessary goods () etc. In all such cases increase (or decrease) in the value of one variable causes corresponding decrease (or increase) in the value of other variable. In case of negative correlation between two variables, one will find downward trend from the upper left hand corner of the graph paper to towards x-axis. One can imagine the demand curve as explained in the economic theory.
3. Linear correlation:
The correlation between two variables is said to be linear where the points when drawn is a graph represents a straight line. Considering two variables X andY, a straight line equation can be as where ___ are represented in real numbers. By using the above formula, with the constant values of ___ and different values of X and Y when plotted in a graph sheet, one will get a straight line. The linear relationship between two varoibales can be interpreted as the change in one unit of one variable (let X) results in the corresponding change in the other variable (let Y) in a fixed proportion.
Thus when the above values of X and Y are represented in graph one will get a straight line. This type of relationship between two variables where a unit change in one variable (X here), the other variable (Y) will change in a constant proportion. However such relations are rarely exists in case of management and social disciplines.
4. Non-linear correlation:
A relationship between two variables is said to be non-linear if a unit change in one variable causes the other variable to change in fluctuations. In other wards, if X is changed then corresponding values of Y will not change in the same proportion. Hence when data of X and Y when plotted in a graph paper one will not get a straight line rather a polynomial. The equation of getting such relationship is
There can be also instances where there does not exist any relationship between two variables i.e., no correlation can be found between two variables. Such relationship is called as â€˜no correlationâ€™. For instance, one wants to compare the growth of population in India with that of road accidents in United States. Such types of relations donâ€™t exist logically. Hence correlation between such relations is said to be nil.
METHODS OF MEASURING CORRELATION:
Correlation between two variables can be measured by following ways.
- The Graphical method (through Scatter Diagram)
- Karl Pearsonâ€™s coefficient of correlation
1. The Graphical Method:
The correlation can be graphically shown by using scatter diagrams. Scatter diagrams reveals two important useful information. Firstly, through this diagram, one can observe the patterns between two variables which indicate whether there exists some association between the variables or not. Secondly, if an association between the variables is found, then it can be easily identified regarding the nature of relationship between the two (whether two variables are linearly related or non-linearly related).
2. Karl Pearsonâ€™s coefficient of correlation
Karl Personâ€™s coefficient of correlation (developed in 1986) measures linear relationship between two variables under study. Since the relationship is expressed is linear, hence, two variables change in a fixed proportion. This measure provides the answer of the degree of relationship in real number, independent of the units in which the variables have been expressed, and also indicates the direction of the correlation.
It is known that ____ as an absolute value for determining correlation between two variables. This measures as a part of absolute measures of dispersion, depends upon the existence of two things like (i) the number of observations denoted as â€˜nâ€™ and (ii) the units of the measurement of the variables under study. The above relationship is explained by assuming that there is a data set which consists of two variables X and Y i.e., in terms of relationship it is denoted as (Xi , Yi) where I = 1, 2, 3,.....,n.
Assumed mean method:
The assumed mean method for calculation of coefficient of correlation can be used when the data size is large and it will be difficult on the part of the researcher to calculate the mean of the series by using the direct method. In such case, a value from the series is assumed as mean and the deviations are calculated from the actual data to that of the assumed mean i.e., if, X and Y are two series of observation than are the deviation values of variable X and Y respectively. That is, , where, L and K are the assumed mean of series X and Y respectively. The formula for calculating Karl Pearsonâ€™s coefficient of correlation.
The above methods derived to calculate the coefficient of correlation cannot be used to calculate the correlation between the two variables when the series of observations are in grouped forms i.e., with frequency distribution. In such a case, the formula for calculating Karl Pearsonâ€™s coefficient of correlation is:
Assumptions of coefficient of correlation:
The Karl Personâ€™s coefficient of correlation can be best derived with some assumptions. Following are some assumptions on which the validity of the coefficient resides.
1. The value of the coefficient of correlation lies between -1 (minus one) to +1 (plus one).
When two values considered in a study are no way related with each other, then one can take for granted that the value of the coefficient of correlation is zero (0). On the other hand, if there exists relationship between two variables, it implies that all points on the scatter diagram fall on the straight line, then the value of correlation coefficient (rXY) is either extend upto +1 or -1, of course depending on the nature of direction of the straight line. It will be positive when the slope of the line is positive and it will be negative when the slope of the line is negative. Telling alternatively, if both the variables X and Y are related directly with each other than the value of the coefficient of correlation will be definitely positive. On the other hand, if there exist inverse relationship between the two values then the value of the coefficient will be negative.
2. The value of the coefficient of correlation is independent of the change of origin and change of scale of measurement.
To prove this assumption, we have change the origin and scale of both the variables. When there will be change in origin and scale of the two values X and Y, the new equation will be where â€˜Aâ€™ and â€˜Bâ€™ used in the above formulas are constraints and measures change in origin and constraints â€˜pâ€™ and â€˜lâ€™ used in the formulas denotes change in scale. Simplifying the above equations reveals that.
RANK CORRELATION COEFFICIENT:
In research, no one can predict the nature of data. The information that is collected from the respondents may be expressed in numbers or may be in qualitative way or quite often they may be expressed in form of ranks. The greatest disadvantage of the Karl Pearsonâ€™s coefficient of correlation is that, it best works when the data is expressed in numbers. On the other hand, Karl Pearsonâ€™s coefficient of correlation, as discussed above, best works when the nature of the data is quantitative or expressed in numbers. Generally, when the nature of data is expressed in qualitative form like honest, good, best, average, excellent, efficiency, etc., and/or the data is expressed only in ranks, one has to apply the Spearmanâ€™s method of rank differences for finding out the degree of correlation. There are three different situations of applying the Spearmanâ€™s rank correlation coefficient.
- When ranks of both the variables are given
- When ranks of both the variables are not given and
- When ranks between two or more observations in a series are equal
Each case derived above can be estimated by using separate formulas.
a. When ranks of both the variables are given
This is the simplest type of calculating correlation between two series. Here is the case where ranks of both the series are given and no two observations in a series are awarded same rank. The formula is where RXY denotes coefficient of rank correlation between two series of observations X and Y d is the difference between the two ranks and n is the number of observations in the series
While calculating RXY, one has to arrange the given observations in a sequence. Then the difference in ranks i.e., d is to be calculated.
The result shows a positive correlation between the judgments revealed by both the judges. However, since the value is not so close towards 1, hence, it can be said that there exists moderate relationship between the ranks assigned by both he judges.
b. When ranks of both the variables are not given
There may be certain situations where the rank of the both the series are not given. In such cases, each observation in the series is to be ranked first. The selection of highest value depends on the researcher. In other wards, either the highest value or the lowest value will be ranked 1 (one) depends upon the decision of the researcher. After the ranking of the variables, then d and d2 are calculated and the above formula can be applied. Following example will make the concept clear.
The result shows a positive degree of correlation between the grade point average and total marks obtained by the students.
c. When ranks between two or more observations in a series are equal
In empirical analysis, there is possibility of assigning same ranks to two or more observations. On the other hand, while ranking observations, there may be some situations where more than one observations are assigned equal ranks. Here, the ranks to be assigned to each observation are an average of the ranks which these observations would have got, if they differed from each other. For example, if two observations are ranked equal at 6th place. If we would rank separately to both these observations, than one will get 6 and the other will get 7. Thus the rank of both the observations will be (6+7)/2= 13/2= 6.5. Now the new ranks of the series who assigned 6 each will be 6.5 each. Similarly, there may be possibility that more than two observations of a series may be ranked equal. Here also the same technique of averaging as derived above is applied to get the new ranks of the observations. The formula for calculating the rank coefficient of correlation in case of equal ranks case is a little bit different form the formula already derived above. It is where d difference between ranks of two series and mi (i= 1, 2, 3, .....) denotes the number of observations in which the ranks are repeated in a series of observations. The example derived below will make the concept clearer.
Interpretation of results of rank coefficient correlation:
- If the value of rank correlation coefficient RXY is greater than 1 (RXY >1), this implies that one set of data series is positively and directly related with the ranks with the other set of data series. In other wards, both the set of observations are directly related. Hence, a observation in one series definitely scores almost same rank in the other series.
- Where as, f the result of rank coefficient of correlation (RXY) is found to be less than zero (RXY < 0). It implies, the sign of the coefficient of correlation will be negative. Here there exist inverse relationships between the two variables. In other wards, increase in ranks of one observation may lead to decrease in the ranks of same observations in the other series.
- On the other condition, let that the value of rank correlation coefficient will be exactly +1 i.e., (RXY = +1). Then it can be said that, there exists exactly perfect correlation between the two series of observations. Here each observation in both the series get exactly equal ranks.
- Where as, if rank correlation is -1 (RXY = -1), implies there exists exactly negative correlation between the ranks of two series. The possibility in such cases is such that, a observation which gets highest rank in one series is getting lowest rank in the other series.
- The last possibility is that of rank coefficient correlation is 0 i.e., (RXY = 0), implies that there do not exist any relation between ranks of both the series of observations.
LINEAR REGRESSION ANALYSIS:
When it is estimated by using the methods of correlation that two variables (or data series) are correlated with other and it is also tested that expression of such relationship between the considered variables are theoretical permissible, then the next step in the process of analysis is of predicting and/or estimating the value of one variable from the known value of the other variable. This task, in econometrics literature is called as â€˜regression analysesâ€™. Literary, the word â€˜regressionâ€™ means a backward movement. In general sense, â€˜regressionâ€™ means the estimation and/or prediction of the unknown value of one variable from the known value of the other variable. Hence, it is a study of the dependence of one variable on other variable(s).
Prediction or estimation of the relationship between two or more variables is one of the major discussion areas in all most all the branches of knowledge where human activity is involved. Regression, as one of the most important econometric tools is extensively used in all most all branches of knowledge like may be in natural sciences, in social sciences and also in physical sciences. But by virtue of the vary nature of most of the branches of social sciences (like economics, commerce, etc.) and business environment, the basic concern in these disciplines is to establish an econometric (or statistical) relationship between the variables rather than getting an exact mathematical relationship (core analysis tool used in natural sciences). For this reason, if, one could able to establish some kind of relationship between two variables (where one variable is considered as dependent variable and other variable(s) are considered as independent variables), then it can be expected that half of the existing purpose is almost solved.
The credit for the development of this technique at first lies with Sir Francis Galton in the year 1877. Galton used this word for the first time in his study where he had estimated the relationship between heights of fathers and sons. This study ended with a conclusion that there is more possibility of having tall fathers with tall sons and vive versa. Again it also observed that, the mean height of sons of tall fathers was lower than the mean height of their fathers and the mean height of sons of short fathers was higher than the mean height of their fathers. This study was published by Galton through his research paper â€˜Regression towards mediocrity in hereditary statureâ€™.
Regression as a tool:
Econometricians use regression analysis to make quantitative estimates of various theoretical relationships exists in the literature of social sciences and management, which previously have been completely theoretical in nature. For example, the famous demand theory of economics says that the quantity demanded of a product will increase when there is reduction in the price of the commodity and vice versa, of course with an assumption that the impact of other things being constant. Hence, anybody can claim that the quantity demanded of blank DVDs will increase if the price of those DVDs will decrease (holding all other factors as constant), but not many people can actually put numbers in to an equation and estimate â€˜by how manyâ€™ DVDs quantity demanded will increase for each reduction in price of Rs. 1/-. To predict the direction of the change, one needs knowledge of economic theory and the general characteristics of the product in question (as the derived example is related to one of the economic theory). However, to predict the amount of the change, along with the data set, one needs a way to estimate the relationship. The most frequently used method to estimate such a relationship in econometrics is regression analysis. As already discussed above, regression analysis describes the dependence of one variable on another or more variables. It is now important to classify the terms dependent and independent variables that are the core of analysis of regression.
Dependent Variables and Independent Variables
Regression analysis, is a statistical technique that attempts to explain movements in one variable, the dependent variable, as a function of movements in a set of other variables, called the independent (or explanatory) variables, through the quantification of a single equation. To make this concept clearer, let us start our discussion by considering a simple example of generalized demand function of economic theory.
The equation (1) derives a functional relationship between six factors (as in the right hand side of the equation) with one variable (as in the left hand side of the equation). In other wards, theoretically, quantity demanded (Qd) of a good or service depends on the six factors like the price of the good itself, money income of the consumer, prices of related goods, expected future price of the product itself, taste pattern of the consumers and the numbers of consumers in the market. In equation (1), quantity demanded is the dependent variable and the other six variables are independent variables.
Much of economics and business is concerned with cause-and-effect propositions: If the price of a good increases by one unit, then the quantity demanded decreases on average by a certain amount, depending on the price elasticity of demand (defined as the percentage change in the quantity demanded that is caused by a one percent change in price). Propositions such as these pose an if-then, or causal, relationship that logically postulates a dependent variable (Qd in our example) having movements that are causally determined by movements in a number of specified independent variables (six factors discussed above).
The Linear Regression Model:
In the regression model, Y is always represented for dependent variable and X is always represented for the independent variable. Here are three equivalent ways to mathematically describe a linear regression model.
The simplest single-equation linear regression model can be written as:
The above equation states that Y, the dependent variable, is a single-equation linear function of variable X, the independent variable. The model is a single-equation model because no equation for X as a function of Y (or any other variable) has been specified. The model is linear because it expresses the relationship of a straight line and if plotted on graph paper, it would be a straight line rather than a curve.
The constants expressed in the equation are the coefficients (or parameters) that determine the coordinates of the straight line at any point. in the equation is the constant or intercept term; it indicates the value of Y when X equals zero. Thus it is the point on the y-axis where the regression line would intercept the y-axis. Where as, in the equation is the slope coefficient, and it indicates the amount that Y will change when X changes by one unit. Figure 1.1 illustrates the relationship between the coefficients and the graphical meaning of the regression equation. As can be seen from the diagram, equation 1.3 is indeed linear.
The slope, , shows the response of Y to change in X. Since being able to explain and predict changes in the dependent variable is the essential reason for quantifying behavioral relationships, most of the emphasis in regression analysis is on slope coefficients such as . In figure 1.1 for example, if X were to increase from X1 to X2, the value of Y in Equation 1.3 would increase from Y1 to Y2. for linear ( i.e., straight-line ) regression models, the response in the predicted value of Y due to a change in X is constant and equal to the slope coefficient:
We must distinguish between an equation that is linear in the variables and one that is linear in the coefficients (or parameters. This distinction is necessary because while linear regressions need to be linear in the coefficients, they do not necessarily need to be linear in the variables. An equation is linear in the variables if plotting the fuction in terms of X and Y genereates a straight line.
An equation is linear in the coefficients (or parameters) only if the coefficients (the ) appear in their simplest from â€“ they are not raised to any powers (other than one), are not multiplied or dived by other coefficients, and do not themselves include some sort of function (like logs or exponents). For example, Equation 1.3 is linear in the coefficients, but equation 1.5:
Is not linear in the coefficients and Equation 1.5 is not linear because there is no rearrangement of the equation that will make it linear in the of original interest, and . In fact, of all possible equations for a single explanatory variable, only functions of the general from:
are linear in the coefficients and .In essence, any sort of configuration of the Xs and Ys can be used and the equation will continue to be linear in the coefficients. However, even a slight change in the configuration of the will cause the equation to become nonlinear in the coefficients. For example, equation 1.4 is not linear in the variables but is linear in the coefficients. The reason that Equation 1.4 is linear in the coefficients is that if you define f(X) = X2, Equation 1.4 fits into the general form of Equation 1.6.
All this is important because if linear regression techniques are going to be applied to an equation, that equation must be linear in the coefficients. Linear regression analysis can be applied to an equation that is nonlinear in the variables if the equation can econometricians use the phraseâ€? linear regression,â€? they usually mean â€œ regression that use the phrase â€œlinear regressionâ€?, they usually mean â€œ regression that is linear in the coefficients.â€? The application of regression techniques to equations that are nonlinear in the coefficients will be discussed in section7.6.