|
On Jul 17, 9:52 pm, Greg Heath <he...@alumni.brown.edu> wrote:
> On Jul 17, 5:40 am, Greg Heath <he...@alumni.brown.edu> wrote:
> > On Jul 15, 3:32 am, Greg Heath <he...@alumni.brown.edu> wrote:
> > > On Jun 25, 12:04 pm, paulvbi...@gmail.com wrote:
> > > > On Jun 24, 11:20 pm, idea_fo...@yahoo.com wrote:
>
> > > > > I recently acquired a copy of some very powerful GP software, but I am
> > > > > new to machine learning and I am not sure where to start. The software
> > > > > allows for classification and regression problems. I am most
> > > > > interested in regression problems (for forecasting), but I'm still
> > > > > learning how to find inputs/outputs.
>
> > > > > My question is, can anyone out there provide me with some inputs and
> > > > > outputs for a fairly simple regression problem that I can solve? The
> > > > > nature of the data can be anything (sun spots, stock market, weather,
> > > > > etc). I simply want to test the software so that I can get a better
> > > > > understanding of how it works. Obviously, the more data the better.
>
> > > > > I would prefer there to be at least 2 inputs and 1 output.
>
> > > > > Any help would be greatly appreciated.
>
> > > > *********************************************
>
> > > > for a round robin test with a colleague in Germany I recently
> > > > investigated the compressive strength of concrete
>
> > > > 1030 data points with 8 variables
>
> > > > all continuous
>
> > > > if you run this I would be interested in what you obtained via 10fcv
>
> > > > best
> > > > Paul
>
> > > >http://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength
>
> > > Lazy me has found that stagewise input variable
> > > subset selection on Linear and Quadratic
> > > Polynomial models is a quick and dirty way to
> > > choose inputs.
>
> > > Stagewise is preferable to stepwise (one-way
> > > greedy forward or backward search) because it
>
> > > 1. combines forward (p-to-enter) and
> > > backward (p-to-remove) search
> > > 2. allows the specification of an
> > > initial subset which is neither
> > > full nor empty
> > > 3. allows the further specification
> > > of initial variables which are not
> > > allowed to be removed.
>
> > > The stagewise MATLAB functions are misnamed
> > > STEPWISEFIT and STEPWISE(Interactive GUI version).
>
> > > Since N/(p+1) = 1030/9 ~ 114 >> 10 , lazy me used
> > > all of the data for both training and validation
> > > with penter = 0.05 and premove = 0.1. Although
> > > the R^2 values were adjusted for design bias by
> > > using the reduced degrees of freedom, they aren't
> > > as unbiased as 10-fold XVAL. However, they should
> > > be sufficient for input variable subset selection.
>
> > > For Linear Regression STEPWISEFIT removed no
> > > variables in the Backward Elimination mode.
> > > In contrast, variables x6 and x7 were not chosen
> > > in the Forward Selection mode. However, none of
> > > the Quadratic Regression models indicated that
> > > x6 or x7 had insignificant prediction capability;
> > > merely that the capability was second order via
> > > cross products and squares.
>
> > > Also, none of the results indicated that x3 had
> > > insignificant prediction capability.
>
> > > Therefore, I used all 8 original variables for
> > > a MLP NN design.
>
> > > The variables were standardized to zero mean and
> > > unit standard deviation. Although 10 pts had
> > > x5 > 3.6 and 18 pts had x8 > 4.9, I had no
> > > convincing reason to remove data points just
> > > because the distributions were skewed.
>
> > > In contrast, Paul removed x3, x6 and 10 outliers.
>
> > > For 10-fold XVAL with I-H-O = 8-H-1,
>
> > > Ntrn = 0.9*N = 927
> > > Neq = Ntrn*O = 927
> > > Nw = (I+1)*H+(H+1)*O = 10*H + 1
> > > Neq > 10*Nw ==> H < 9.17
>
> > > Using MATLAB's TRAINBR for regularized training
> > > with weight-decay, the R^2 summary statistics are
>
> > > H min median mean stdv max
> > > 1 0.6027 0.6790 0.6831 0.0419 0.7461
> > > 2 0.7640 0.8198 0.8160 0.0295 0.8498
> > > 3 0.8314 0.8647 0.8640 0.0190 0.8910
> > > 4 0.8337 0.8733 0.8702 0.0211 0.8980
> > > 5 0.8435 0.8777 0.8765 0.0182 0.9023
> > > 6 0.8500 0.8905 0.8859 0.0179 0.9129
> > > 7 0.8635 0.8870 0.8892 0.0199 0.9215
> > > 8 0.8799 0.9000 0.8964 0.0128 0.9148
> > > 9 0.8645 0.8918 0.8975 0.0215 0.9361
>
> > > Resulting in the quote
>
> > > R^2 = 0.90 +/- 0.02 for H = 9.
>
> > > The program ran for 183 sec on a 3.2GHz DELL with
> > > Windows XP.
>
> > > However, since regularization is used, there is no
> > > compelling reason to limit H to <= 9. Therefore,
> > > H = 20 was run with the result
>
> > > H min median mean stdv max
> > > 20 0.7803 0.9216 0.9050 0.0524 0.9540
>
> > > or
>
> > > R^2 = 0.91 +/- 0.05 for H = 20.
>
> > Removing x3 and x6 yields
>
> > For 10-fold XVAL with I-H-O = 6-H-1,
>
> > Ntrn = 0.9*N = 927
> > Neq = Ntrn*O = 927
> > Nw = (I+1)*H+(H+1)*O = 8*H + 1
> > Neq > 10*Nw ==> H < 11.4625
>
> > Using MATLAB's TRAINBR for regularized training
> > with weight-decay, the R^2 summary statistics are
>
> > H min median mean stdv max
> > 1 0.6115 0.6799 0.6769 0.0384 0.7260
> > 2 0.7579 0.8177 0.8101 0.0293 0.8433
> > 3 0.7896 0.8388 0.8327 0.0255 0.8620
> > 4 0.8117 0.8536 0.8475 0.0233 0.8753
> > 5 0.8252 0.8602 0.8578 0.0187 0.8821
> > 6 0.8347 0.8649 0.8697 0.0239 0.8992
> > 7 0.8419 0.8753 0.8739 0.0164 0.8952
> > 8 0.8330 0.8896 0.8833 0.0220 0.9058
> > 9 0.8543 0.8876 0.8800 0.0193 0.9009
> > 10 0.8395 0.8986 0.8881 0.0248 0.9109
> > 11 0.8462 0.8953 0.8901 0.0233 0.9141
>
> > Resulting in the quotes
>
> > R^2 = 0.88 +/- 0.02 for H = 9.
> > R^2 = 0.89 +/- 0.02 for H = 11.
>
> > The program ran for 219 sec on a 3.2GHz DELL with
> > Windows XP.
>
> > However, since regularization is used, there is no
> > compelling reason to limit H to <= 11. Therefore,
> > H = 20 was run with the result
>
> > H min median mean stdv max
> > 20 0.8377 0.9212 0.9057 0.0373 0.9505
>
> > or
>
> > R^2 = 0.91 +/- 0.04 for H = 20.
>
> > So, ... excluding 3 and 6 doesn't appear to
> > significantly degrade performance.
>
> Moving on ...
>
> Removing x3, x6 and x7 yields
>
> For 10-fold XVAL with I-H-O = 5-H-1,
>
> Ntrn = 0.9*N = 927
> Neq = Ntrn*O = 927
> Nw = (I+1)*H+(H+1)*O = 7*H + 1
> Neq > 10*Nw ==> H < 13.1
>
> H min median mean stdv max
> 9 0.8413 0.8922 0.8837 0.0243 0.9167
> 11 0.8508 0.8887 0.8865 0.0244 0.9208
> 20. 0.8552 0.9092 0.8970 0.0259 0.9269
>
> 9 R^2 = 0.88 +/- 0.02
> 11 R^2 = 0.89 +/- 0.02
> 20 R^2 = 0.90 +/- 0.03
Why stop here?
H min median mean stdv max
10 0.8540 0.8856 0.8835 0.0201 0.9152
20 0.8372 0.8902 0.8944 0.0298 0.9379
30 0.8350 0.9096 0.9003 0.0304 0.9381
40 0.8414 0.9150 0.9064 0.0282 0.9314
50 0.8575 0.9079 0.9050 0.0273 0.9494
60 0.8550 0.9009 0.9026 0.0230 0.9387
70 0.8138 0.9069 0.9030 0.0353 0.9414
Hope this helps.
Greg
|