John, nice work.
I encountered a minor glitch.
When specifying the modelterms as an array of exponents, the variable 'varlist' was not assigned a value.
I fixed this by inserting the following at line 160:
else varlist = repmat({''},1,p);
Despite my having written polyfitn, high order polynomials (especially in multiple dimensions) are rarely a good substitute for a nonlinear model that is chosen to have the proper behavior. That is, IF you have such a model. Those polynomials tend to do nasty things to you, especially in extrapolation. And unless you have sufficient data, then the model may behave strangely even inside your data. I call this the problem of interpolation.
Basically, you need to be very careful with high order polynomial fits.
By the way, nlinfit DOES handle multiple independent variables. You suggest it does not, but there is no constraint on that. The problem is choosing an intelligent model in multiple dimensions. This is EXTREMELY difficult in general. One good reason is that a functional form like w = f(x,y,z) is so difficult to visualize.
Personally, my preference has always been to move to a spline-like model for these problems. I am often unable to pull a model out of thin air, and I would prefer to avoid a high order polynomial. My own gridfit is one simple example, with 2 independent variables.
Hello John,
Thank you for your inputs. I really appreciate the help. From what i have understood is that the data is not sufficient enough for using such a high order polynomial fit. I am actually looking at fitting the data to my own custom equation but haven't found a model for it yet- similar to nlin fit but with multiple independent variables. If you do have any suggestions please do let me know.
Thank you.
Karan
Hi Karan,
You've fallen into a common fallacy in regression modeling. It goes like this: "If n is good, then n+1 is better, and n+2 must be really wonderful."
The fallacy starts out easily. We try a linear model on our data. It works, all right, but not very well. After all, must problems are only pretty locally linear. So we try a quadratic model. It fits better. The lack of fit is reduced. But there are still problems in the model. It is inadequate to fit the shape of our data.
Since we did so well with the quadratic, try out a third order model. Yep, it does better still. Wow. Keep pushing things, until we have a 6th order model, or 10th order. A polynomial model is just like a Taylor series approximation, so more terms must give a better result. Eventually though, it all just turns to numerical crap before our eyes. So what happened?
Here you have 3 variables. A first order model has a constant term, as well as x, y, and z terms. So 4 terms in that model. You have 74 data points, and surely had no problem estimating the corresponding coefficients.
A quadratic model adds in 6 more terms, x^2, y^2, z^2, but also the cross product terms x*y, x*z, and y*z, so a total of 10 coefficients to estimate.
The curse of our fallacy is that as we move up in order, the number of terms grows faster and faster. At 3rd order, we have 20 terms to estimate, 35 at 4th order, 56 at 5th order, and 84 at 6th order. A 10th order model would have 286 terms.
How about your data? You have only 74 data points. (In my early days when I did some statistical consulting, my clients would always ask how much data they needed. My answer was almost always "More than you have.")
Every distinct data point is at most one additional piece of information. (Replicate points do not count here. All they do is reduce the uncertainty in your estimates. So replicates are good, but not quite as valuable as distinct points in some respects.)
Worse, new data points do not always provide new information. For example, two points determine a line. A third point on that line adds no new information towards the slope or intercept of that line. Again, like a replicate, all it does is reduce the uncertainty in your estimate of the coefficients.
So what happened when you tried to estimate a 6th order model with only 74 data points? You had too little information in your data to support such a large model. The message that comes back is from MATLAB, telling you that your matrix is numerically singular (rank deficient.)
Having said all this, things can get worse yet, making the problem yet more difficult. Some sets of data are worse than others. For example, if you add 1e6 to all of your independent variables, a polynomial of some given order will still be estimable mathematically (in theory), yet it may result in a matrix that is numerically singular. This is because the linear algebra is written using double precision floating point arithmetic. The numbers may simply lie over too wide of a dynamic range for the algebra to succeed. (What happens when you raise numbers that are on the order of 1e6 to successively higher powers?)
So when you have more coefficients to estimate than data points, you will always have a rank deficient matrix, but depending on the data itself, you may still have rank deficiency problems.
Some polynomial models will become more tractable to estimate if you center and scale the variables. Thus make each variable so it has a mean of zero, and lie roughly in the interval [-1,1]. See that raising numbers that are on the order of +/-1 to higher powers does NOT create a dynamic range issue. This will force you to transform your problem, but it often allows you to gain viable estimates for an otherwise intractable problem. (The use of orthogonal polynomials is another approach that often helps here.)
I hope this helps. Really, this is a subject that I've seen taught as a semester long course, so I have only touched the surface.
John
Thank you so much for this. I am getting this warning-
Warning: Rank deficient, rank = 64, tol = 5.8216e-010.
I have used 3 independent variables, 74 data points and used degree of 6. Could you please help me understand what this means? Thank you.
Comment only