I’m currently enrolled in Andrew Ng’s Machine Learning class on coursera.org (highly recommended!) and thought I’d write a post about one of the first topics we’re covering, linear regression. Though Dr. Ng explains the mechanics of this technique very clearly, there’s not many examples of situations in which the technique is commonly used, or examples of the types of problems that are solved by the techniques. I’ll do my best to fill in this gap!
What is linear regression?
Linear regression is the technique of approximating output values for a given input set using a polynomial function. Given an input matrix X & an output vector Y, find coefficients A & B such that XA + B produces a vector that is as close to Y as possible. Finding the coefficients A & B allows a person to then predict output values given similar input values.
When to use it?
Linear regression assumes that the relationship between the input values in X and the dependent values in Y have a linear relationship. Also, linear regression produces values that can be used as coefficients in a continuous function. So, if you suspect that your inputs and outputs have a linear relationship, and the output is effectively continuous (rather than discrete), try using linear regression to approximate the relationship.
Before moving onto examples of good candidates for using linear regression, I think it’s worth mentioning that a linear regression model can fit small numbers of training examples perfectly, but in this case the model probably won’t have good predictive value. For example, since two points define a line, a 1st degree polynomial trained on two points will pass through both points, and the training accuracy will be 100%. However, most relationships needing linear regression are probably not perfectly linear, so the model will probably produce large errors when trying to predict outputs for real-world inputs. Therefore, using linear regression is usually only a good idea when the number of training examples is much larger than the degree of the polynomial that is used to model the relationship.
Good candidates for using linear regression
- Predicting home prices using recent sales, architectural, and location data for nearby houses
- Predicting weight of children as they grow using biometric data from the parents & other children
- Predicting river depth using historical & current river depth, snowfall, rainfall, and temperature data
Common attributes in these examples include:
- Prediction – a person would use linear regression to predict an unknown future value based on known present values
- Continuous linear relationship – Of course things like prices, weights, and heights are discrete since they reflect real-world situations, objects, and measurements using real-world tools. However, for modeling purposes, they are effectively continuous since the output could take any value in a range. Also, it’s pretty clear that the relationship between the inputs and outputs in all cases is probably linear.
Poor candidates for using linear regression
It’s helpful to contrast good candidates with poor candidates for linear regression. Here are some situations where using linear regression would probably not provide good predictive value.
Predicting whether or not a tumor is cancerous based on the size, color, and shape
The output is yes or no, which is a discrete output. This is a better candidate for logistic regression.
Predicting which color car a customer is likely to buy
Again, the output is discrete. In this case, there may be as many as 30 colors available, but the way that cars are sold ensures that the customer cannot pick any color in a range.
Estimating flight time from BWI to JFK given data on previous flights
This value could be estimated using linear regression with historical flight data, weather, and plane information. However, a good estimate could be computed directly simply by using the expected flight speed, the time to perform the takeoff & landing maneuvers, and the estimated wind speeds at the cruising altitudes. Another reason that this is a poor candidate for linear regression is because it may be difficult or impossible to collect important input information, such as historical wind speed data, for any given flight.
Predicting a HS junior’s SAT score based on the current day’s temperature
There is probably no strong linear relationship between any particular person’s SAT score and the current day’s weather (except maybe in the case where the temperature is -50F and everybody decides to skip the test, resulting in a score of 0). A better problem for linear regression would be to predict SAT score based on grades.
So there you have it – use linear regression when you think your inputs and outputs have a linear relationship, and the output is effectively continuous.
I’d love to hear any other situations where linear regression would be valuable! I’d also love to get any corrections or feedback on this topic – please leave a comment below if you can help improve this article.