|
On Jul 17, 5:40=A0am, Greg Heath <he...@alumni.brown.edu> wrote:
> On Jul 15, 3:32 am, Greg Heath <he...@alumni.brown.edu> wrote:
>
>
>
>
>
> > On Jun 25, 12:04 pm, paulvbi...@gmail.com wrote:
> > > On Jun 24, 11:20 pm, idea_fo...@yahoo.com wrote:
>
> > > > I recently acquired a copy of some very powerful GP software, but I=
am
> > > > new to machine learning and I am not sure where to start. The softw=
are
> > > > allows for classification and regression problems. I am most
> > > > interested in regression problems (for forecasting), but I'm still
> > > > learning how to find inputs/outputs.
>
> > > > My question is, can anyone out there provide me with some inputs an=
d
> > > > outputs for a fairly simple regression problem that I can solve? Th=
e
> > > > nature of the data can be anything (sun spots, stock market, weathe=
r,
> > > > etc). I simply want to test the software so that I can get a better
> > > > understanding of how it works. Obviously, the more data the better.
>
> > > > I would prefer there to be at least 2 inputs and 1 output.
>
> > > > Any help would be greatly appreciated.
>
> > > *********************************************
>
> > > for a round robin test with a colleague in Germany I recently
> > > investigated the compressive strength of concrete
>
> > > 1030 data points with 8 variables
>
> > > all continuous
>
> > > if you run this I would be interested in what you obtained via 10fcv
>
> > > best
> > > Paul
>
> > >http://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength
>
> > Lazy me has found that stagewise input variable
> > subset selection on Linear and Quadratic
> > Polynomial models is a quick and dirty way to
> > choose inputs.
>
> > Stagewise is preferable to stepwise (one-way
> > greedy forward or backward search) because it
>
> > 1. combines forward (p-to-enter) and
> > backward (p-to-remove) search
> > 2. allows the specification of an
> > initial subset which is neither
> > full nor empty
> > 3. allows the further specification
> > of initial variables which are not
> > allowed to be removed.
>
> > The stagewise MATLAB functions are misnamed
> > STEPWISEFIT and STEPWISE(Interactive GUI version).
>
> > Since N/(p+1) =3D 1030/9 ~ 114 >> 10 , lazy me used
> > all of the data for both training and validation
> > with penter =3D 0.05 and premove =3D 0.1. Although
> > the R^2 values were adjusted for design bias by
> > using the reduced degrees of freedom, they aren't
> > as unbiased as 10-fold XVAL. However, they should
> > be sufficient for input variable subset selection.
>
> > For Linear Regression STEPWISEFIT removed no
> > variables in the Backward Elimination mode.
> > In contrast, variables x6 and x7 were not chosen
> > in the Forward Selection mode. However, none of
> > the Quadratic Regression models indicated that
> > x6 or x7 had insignificant prediction capability;
> > merely that the capability was second order via
> > cross products and squares.
>
> > Also, none of the results indicated that x3 had
> > insignificant prediction capability.
>
> > Therefore, I used all 8 original variables for
> > a MLP NN design.
>
> > The variables were standardized to zero mean and
> > unit standard deviation. Although 10 pts had
> > x5 > 3.6 and 18 pts had x8 > 4.9, I had no
> > convincing reason to remove data points just
> > because the distributions were skewed.
>
> > In contrast, Paul removed x3, x6 and 10 outliers.
>
> > For 10-fold XVAL with I-H-O =3D 8-H-1,
>
> > Ntrn =3D 0.9*N =A0=3D 927
> > Neq =A0=3D Ntrn*O =3D 927
> > Nw =3D (I+1)*H+(H+1)*O =3D 10*H + 1
> > Neq > 10*Nw =A0=3D=3D> H < 9.17
>
> > Using MATLAB's TRAINBR for regularized training
> > with weight-decay, the R^2 summary statistics are
>
> > H =A0 =A0 =A0min =A0 =A0 =A0median =A0 =A0 mean =A0 =A0 =A0 stdv =A0 =
=A0 =A0 =A0max
> > 1 =A0 =A00.6027 =A0 =A00.6790 =A0 =A00.6831 =A0 =A00.0419 =A0 =A00.7461
> > 2 =A0 =A00.7640 =A0 =A00.8198 =A0 =A00.8160 =A0 =A00.0295 =A0 =A00.8498
> > 3 =A0 =A00.8314 =A0 =A00.8647 =A0 =A00.8640 =A0 =A00.0190 =A0 =A00.8910
> > 4 =A0 =A00.8337 =A0 =A00.8733 =A0 =A00.8702 =A0 =A00.0211 =A0 =A00.8980
> > 5 =A0 =A00.8435 =A0 =A00.8777 =A0 =A00.8765 =A0 =A00.0182 =A0 =A00.9023
> > 6 =A0 =A00.8500 =A0 =A00.8905 =A0 =A00.8859 =A0 =A00.0179 =A0 =A00.9129
> > 7 =A0 =A00.8635 =A0 =A00.8870 =A0 =A00.8892 =A0 =A00.0199 =A0 =A00.9215
> > 8 =A0 =A00.8799 =A0 =A00.9000 =A0 =A00.8964 =A0 =A00.0128 =A0 =A00.9148
> > 9 =A0 =A00.8645 =A0 =A00.8918 =A0 =A00.8975 =A0 =A00.0215 =A0 =A00.9361
>
> > Resulting in the quote
>
> > R^2 =3D 0.90 +/- 0.02 for H =3D 9.
>
> > The program ran for 183 sec on a 3.2GHz DELL with
> > Windows XP.
>
> > However, since regularization is used, there is no
> > compelling reason to limit H to <=3D 9. Therefore,
> > H =3D 20 was run with the result
>
> > H =A0 =A0 min =A0 =A0 =A0 =A0median =A0 =A0 mean =A0 =A0 =A0stdv =A0 =
=A0 =A0 max
> > 20 =A0 0.7803 =A0 =A00.9216 =A0 =A00.9050 =A0 =A00.0524 =A0 =A00.9540
>
> > or
>
> > R^2 =3D 0.91 +/- 0.05 for H =3D 20.
>
> Removing x3 and x6 yields
>
> For 10-fold XVAL with I-H-O =3D 6-H-1,
>
> Ntrn =3D 0.9*N =A0=3D 927
> Neq =A0=3D Ntrn*O =3D 927
> Nw =3D (I+1)*H+(H+1)*O =3D 8*H + 1
> Neq > 10*Nw =A0=3D=3D> H < 11.4625
>
> Using MATLAB's TRAINBR for regularized training
> with weight-decay, the R^2 summary statistics are
>
> =A0H =A0 =A0 =A0min =A0 =A0 median =A0 =A0 mean =A0 =A0 =A0stdv =A0 =A0 =
=A0 max
> =A01 =A0 =A00.6115 =A0 =A00.6799 =A0 =A00.6769 =A0 =A00.0384 =A0 =A00.726=
0
> =A02 =A0 =A00.7579 =A0 =A00.8177 =A0 =A00.8101 =A0 =A00.0293 =A0 =A00.843=
3
> =A03 =A0 =A00.7896 =A0 =A00.8388 =A0 =A00.8327 =A0 =A00.0255 =A0 =A00.862=
0
> =A04 =A0 =A00.8117 =A0 =A00.8536 =A0 =A00.8475 =A0 =A00.0233 =A0 =A00.875=
3
> =A05 =A0 =A00.8252 =A0 =A00.8602 =A0 =A00.8578 =A0 =A00.0187 =A0 =A00.882=
1
> =A06 =A0 =A00.8347 =A0 =A00.8649 =A0 =A00.8697 =A0 =A00.0239 =A0 =A00.899=
2
> =A07 =A0 =A00.8419 =A0 =A00.8753 =A0 =A00.8739 =A0 =A00.0164 =A0 =A00.895=
2
> =A08 =A0 =A00.8330 =A0 =A00.8896 =A0 =A00.8833 =A0 =A00.0220 =A0 =A00.905=
8
> =A09 =A0 =A00.8543 =A0 =A00.8876 =A0 =A00.8800 =A0 =A00.0193 =A0 =A00.900=
9
> 10 =A0 =A00.8395 =A0 =A00.8986 =A0 =A00.8881 =A0 =A00.0248 =A0 =A00.9109
> 11 =A0 =A00.8462 =A0 =A00.8953 =A0 =A00.8901 =A0 =A00.0233 =A0 =A00.9141
>
> Resulting in the quotes
>
> R^2 =3D 0.88 +/- 0.02 for H =3D 9.
> R^2 =3D 0.89 +/- 0.02 for H =3D 11.
>
> The program ran for 219 sec on a 3.2GHz DELL with
> Windows XP.
>
> However, since regularization is used, there is no
> compelling reason to limit H to <=3D 11. Therefore,
> H =3D 20 was run with the result
>
> H =A0 =A0 min =A0 =A0 =A0 =A0median =A0 =A0 mean =A0 =A0 =A0stdv =A0 =A0 =
=A0 max
> 20 =A0 0.8377 =A0 =A00.9212 =A0 =A00.9057 =A0 =A00.0373 =A0 =A00.9505
>
> or
>
> R^2 =3D 0.91 +/- 0.04 for H =3D 20.
>
> So, ... excluding 3 and 6 doesn't appear to
> significantly degrade performance.
Moving on ...
Removing x3, x6 and x7 yields
For 10-fold XVAL with I-H-O =3D 5-H-1,
Ntrn =3D 0.9*N =3D 927
Neq =3D Ntrn*O =3D 927
Nw =3D (I+1)*H+(H+1)*O =3D 7*H + 1
Neq > 10*Nw =3D=3D> H < 13.1
H min median mean stdv max
9 0.8413 0.8922 0.8837 0.0243 0.9167
11 0.8508 0.8887 0.8865 0.0244 0.9208
20. 0.8552 0.9092 0.8970 0.0259 0.9269
9 R^2 =3D 0.88 +/- 0.02
11 R^2 =3D 0.89 +/- 0.02
20 R^2 =3D 0.90 +/- 0.03
Hope this helps.
Greg
|