Because of the nature of the partitioned least squares, you also gain in terms of speed, because it reduces the effective size of the problem. But you will need to learn how to split the variables into conditionally linear versus truly nonlinear sets. I can't help you there, because you don't tell us that model you use.
I don't know of a better solution. It has been a while, but I recall throughput speeds of as much as 250 to 1 compared to a simple loop when I originally developed the idea. That will depend on your problem of course. I see in my comments that a 3 variable example showed a 13-1 speed boost.
You will need to compute the various statistics yourself. But that should be trivial in most cases.
You will also need the optimization toolbox, since I don't believe the curve fitting toolbox has the necessary sparse Jacobian matrix capabilities that I use.