<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/172806</link>
    <title>MATLAB Central Newsreader - Re: Give me a Regression Problem</title>
    <description>Feed for thread: Re: Give me a Regression Problem</description>
    <language>en-us</language>
    <copyright>&amp;copy;1994-2012 by MathWorks, Inc.</copyright>
    <webmaster>webmaster@mathworks.com</webmaster>
    <generator>MATLAB Central Newsreader</generator>
    <docs>http://blogs.law.harvard.edu/tech/rss</docs>
    <ttl>60</ttl>
    <image>
      <title>MathWorks</title>
      <url>http://www.mathworks.com/images/membrane_icon.gif</url>
    </image>
    <item>
      <pubDate>Fri, 18 Jul 2008 01:52:51 -0400</pubDate>
      <title>Re: Give me a Regression Problem</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/172806#443803</link>
      <author>Greg Heath</author>
      <description>On Jul 17, 5:40=A0am, Greg Heath &amp;lt;he...@alumni.brown.edu&amp;gt; wrote:&lt;br&gt;
&amp;gt; On Jul 15, 3:32 am, Greg Heath &amp;lt;he...@alumni.brown.edu&amp;gt; wrote:&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; On Jun 25, 12:04 pm, paulvbi...@gmail.com wrote:&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; On Jun 24, 11:20 pm, idea_fo...@yahoo.com wrote:&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; I recently acquired a copy of some very powerful GP software, but I=&lt;br&gt;
&amp;nbsp;am&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; new to machine learning and I am not sure where to start. The softw=&lt;br&gt;
are&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; allows for classification and regression problems. I am most&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; interested in regression problems (for forecasting), but I'm still&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; learning how to find inputs/outputs.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; My question is, can anyone out there provide me with some inputs an=&lt;br&gt;
d&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; outputs for a fairly simple regression problem that I can solve? Th=&lt;br&gt;
e&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; nature of the data can be anything (sun spots, stock market, weathe=&lt;br&gt;
r,&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; etc). I simply want to test the software so that I can get a better&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; understanding of how it works. Obviously, the more data the better.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; I would prefer there to be at least 2 inputs and 1 output.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; Any help would be greatly appreciated.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; *********************************************&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; for a round robin test with a colleague in Germany I recently&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; investigated the compressive strength of concrete&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; 1030 data points with 8 variables&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; all continuous&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; if you run this I would be interested in what you obtained via 10fcv&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; best&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt; Paul&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; &amp;gt;&lt;a href=&quot;http://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength&quot;&gt;http://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength&lt;/a&gt;&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Lazy me has found that stagewise input variable&lt;br&gt;
&amp;gt; &amp;gt; subset selection on Linear and Quadratic&lt;br&gt;
&amp;gt; &amp;gt; Polynomial models is a quick and dirty way to&lt;br&gt;
&amp;gt; &amp;gt; choose inputs.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Stagewise is preferable to stepwise (one-way&lt;br&gt;
&amp;gt; &amp;gt; greedy forward or backward search) because it&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; 1. combines forward (p-to-enter) and&lt;br&gt;
&amp;gt; &amp;gt; backward (p-to-remove) search&lt;br&gt;
&amp;gt; &amp;gt; 2. allows the specification of an&lt;br&gt;
&amp;gt; &amp;gt; initial subset which is neither&lt;br&gt;
&amp;gt; &amp;gt; full nor empty&lt;br&gt;
&amp;gt; &amp;gt; 3. allows the further specification&lt;br&gt;
&amp;gt; &amp;gt; of initial variables which are not&lt;br&gt;
&amp;gt; &amp;gt; allowed to be removed.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; The stagewise MATLAB functions are misnamed&lt;br&gt;
&amp;gt; &amp;gt; STEPWISEFIT and STEPWISE(Interactive GUI version).&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Since N/(p+1) =3D 1030/9 ~ 114 &amp;gt;&amp;gt; 10 , lazy me used&lt;br&gt;
&amp;gt; &amp;gt; all of the data for both training and validation&lt;br&gt;
&amp;gt; &amp;gt; with penter =3D 0.05 and premove =3D 0.1. Although&lt;br&gt;
&amp;gt; &amp;gt; the R^2 values were adjusted for design bias by&lt;br&gt;
&amp;gt; &amp;gt; using the reduced degrees of freedom, they aren't&lt;br&gt;
&amp;gt; &amp;gt; as unbiased as 10-fold XVAL. However, they should&lt;br&gt;
&amp;gt; &amp;gt; be sufficient for input variable subset selection.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; For Linear Regression STEPWISEFIT removed no&lt;br&gt;
&amp;gt; &amp;gt; variables in the Backward Elimination mode.&lt;br&gt;
&amp;gt; &amp;gt; In contrast, variables x6 and x7 were not chosen&lt;br&gt;
&amp;gt; &amp;gt; in the Forward Selection mode. However, none of&lt;br&gt;
&amp;gt; &amp;gt; the Quadratic Regression models indicated that&lt;br&gt;
&amp;gt; &amp;gt; x6 or x7 had insignificant prediction capability;&lt;br&gt;
&amp;gt; &amp;gt; merely that the capability was second order via&lt;br&gt;
&amp;gt; &amp;gt; cross products and squares.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Also, none of the results indicated that x3 had&lt;br&gt;
&amp;gt; &amp;gt; insignificant prediction capability.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Therefore, I used all 8 original variables for&lt;br&gt;
&amp;gt; &amp;gt; a MLP NN design.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; The variables were standardized to zero mean and&lt;br&gt;
&amp;gt; &amp;gt; unit standard deviation. Although 10 pts had&lt;br&gt;
&amp;gt; &amp;gt; x5 &amp;gt; 3.6 and 18 pts had x8 &amp;gt; 4.9, I had no&lt;br&gt;
&amp;gt; &amp;gt; convincing reason to remove data points just&lt;br&gt;
&amp;gt; &amp;gt; because the distributions were skewed.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; In contrast, Paul removed x3, x6 and 10 outliers.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; For 10-fold XVAL with I-H-O =3D 8-H-1,&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Ntrn =3D 0.9*N =A0=3D 927&lt;br&gt;
&amp;gt; &amp;gt; Neq =A0=3D Ntrn*O =3D 927&lt;br&gt;
&amp;gt; &amp;gt; Nw =3D (I+1)*H+(H+1)*O =3D 10*H + 1&lt;br&gt;
&amp;gt; &amp;gt; Neq &amp;gt; 10*Nw =A0=3D=3D&amp;gt; H &amp;lt; 9.17&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Using MATLAB's TRAINBR for regularized training&lt;br&gt;
&amp;gt; &amp;gt; with weight-decay, the R^2 summary statistics are&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; H =A0 =A0 =A0min =A0 =A0 =A0median =A0 =A0 mean =A0 =A0 =A0 stdv =A0 =&lt;br&gt;
=A0 =A0 =A0max&lt;br&gt;
&amp;gt; &amp;gt; 1 =A0 =A00.6027 =A0 =A00.6790 =A0 =A00.6831 =A0 =A00.0419 =A0 =A00.7461&lt;br&gt;
&amp;gt; &amp;gt; 2 =A0 =A00.7640 =A0 =A00.8198 =A0 =A00.8160 =A0 =A00.0295 =A0 =A00.8498&lt;br&gt;
&amp;gt; &amp;gt; 3 =A0 =A00.8314 =A0 =A00.8647 =A0 =A00.8640 =A0 =A00.0190 =A0 =A00.8910&lt;br&gt;
&amp;gt; &amp;gt; 4 =A0 =A00.8337 =A0 =A00.8733 =A0 =A00.8702 =A0 =A00.0211 =A0 =A00.8980&lt;br&gt;
&amp;gt; &amp;gt; 5 =A0 =A00.8435 =A0 =A00.8777 =A0 =A00.8765 =A0 =A00.0182 =A0 =A00.9023&lt;br&gt;
&amp;gt; &amp;gt; 6 =A0 =A00.8500 =A0 =A00.8905 =A0 =A00.8859 =A0 =A00.0179 =A0 =A00.9129&lt;br&gt;
&amp;gt; &amp;gt; 7 =A0 =A00.8635 =A0 =A00.8870 =A0 =A00.8892 =A0 =A00.0199 =A0 =A00.9215&lt;br&gt;
&amp;gt; &amp;gt; 8 =A0 =A00.8799 =A0 =A00.9000 =A0 =A00.8964 =A0 =A00.0128 =A0 =A00.9148&lt;br&gt;
&amp;gt; &amp;gt; 9 =A0 =A00.8645 =A0 =A00.8918 =A0 =A00.8975 =A0 =A00.0215 =A0 =A00.9361&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; Resulting in the quote&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; R^2 =3D 0.90 +/- 0.02 for H =3D 9.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; The program ran for 183 sec on a 3.2GHz DELL with&lt;br&gt;
&amp;gt; &amp;gt; Windows XP.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; However, since regularization is used, there is no&lt;br&gt;
&amp;gt; &amp;gt; compelling reason to limit H to &amp;lt;=3D 9. Therefore,&lt;br&gt;
&amp;gt; &amp;gt; H =3D 20 was run with the result&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; H =A0 =A0 min =A0 =A0 =A0 =A0median =A0 =A0 mean =A0 =A0 =A0stdv =A0 =&lt;br&gt;
=A0 =A0 max&lt;br&gt;
&amp;gt; &amp;gt; 20 =A0 0.7803 =A0 =A00.9216 =A0 =A00.9050 =A0 =A00.0524 =A0 =A00.9540&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; or&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; &amp;gt; R^2 =3D 0.91 +/- 0.05 for H =3D 20.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Removing x3 and x6 yields&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; For 10-fold XVAL with I-H-O =3D 6-H-1,&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Ntrn =3D 0.9*N =A0=3D 927&lt;br&gt;
&amp;gt; Neq =A0=3D Ntrn*O =3D 927&lt;br&gt;
&amp;gt; Nw =3D (I+1)*H+(H+1)*O =3D 8*H + 1&lt;br&gt;
&amp;gt; Neq &amp;gt; 10*Nw =A0=3D=3D&amp;gt; H &amp;lt; 11.4625&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Using MATLAB's TRAINBR for regularized training&lt;br&gt;
&amp;gt; with weight-decay, the R^2 summary statistics are&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; =A0H =A0 =A0 =A0min =A0 =A0 median =A0 =A0 mean =A0 =A0 =A0stdv =A0 =A0 =&lt;br&gt;
=A0 max&lt;br&gt;
&amp;gt; =A01 =A0 =A00.6115 =A0 =A00.6799 =A0 =A00.6769 =A0 =A00.0384 =A0 =A00.726=&lt;br&gt;
0&lt;br&gt;
&amp;gt; =A02 =A0 =A00.7579 =A0 =A00.8177 =A0 =A00.8101 =A0 =A00.0293 =A0 =A00.843=&lt;br&gt;
3&lt;br&gt;
&amp;gt; =A03 =A0 =A00.7896 =A0 =A00.8388 =A0 =A00.8327 =A0 =A00.0255 =A0 =A00.862=&lt;br&gt;
0&lt;br&gt;
&amp;gt; =A04 =A0 =A00.8117 =A0 =A00.8536 =A0 =A00.8475 =A0 =A00.0233 =A0 =A00.875=&lt;br&gt;
3&lt;br&gt;
&amp;gt; =A05 =A0 =A00.8252 =A0 =A00.8602 =A0 =A00.8578 =A0 =A00.0187 =A0 =A00.882=&lt;br&gt;
1&lt;br&gt;
&amp;gt; =A06 =A0 =A00.8347 =A0 =A00.8649 =A0 =A00.8697 =A0 =A00.0239 =A0 =A00.899=&lt;br&gt;
2&lt;br&gt;
&amp;gt; =A07 =A0 =A00.8419 =A0 =A00.8753 =A0 =A00.8739 =A0 =A00.0164 =A0 =A00.895=&lt;br&gt;
2&lt;br&gt;
&amp;gt; =A08 =A0 =A00.8330 =A0 =A00.8896 =A0 =A00.8833 =A0 =A00.0220 =A0 =A00.905=&lt;br&gt;
8&lt;br&gt;
&amp;gt; =A09 =A0 =A00.8543 =A0 =A00.8876 =A0 =A00.8800 =A0 =A00.0193 =A0 =A00.900=&lt;br&gt;
9&lt;br&gt;
&amp;gt; 10 =A0 =A00.8395 =A0 =A00.8986 =A0 =A00.8881 =A0 =A00.0248 =A0 =A00.9109&lt;br&gt;
&amp;gt; 11 =A0 =A00.8462 =A0 =A00.8953 =A0 =A00.8901 =A0 =A00.0233 =A0 =A00.9141&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; Resulting in the quotes&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; R^2 =3D 0.88 +/- 0.02 for H =3D 9.&lt;br&gt;
&amp;gt; R^2 =3D 0.89 +/- 0.02 for H =3D 11.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; The program ran for 219 sec on a 3.2GHz DELL with&lt;br&gt;
&amp;gt; Windows XP.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; However, since regularization is used, there is no&lt;br&gt;
&amp;gt; compelling reason to limit H to &amp;lt;=3D 11. Therefore,&lt;br&gt;
&amp;gt; H =3D 20 was run with the result&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; H =A0 =A0 min =A0 =A0 =A0 =A0median =A0 =A0 mean =A0 =A0 =A0stdv =A0 =A0 =&lt;br&gt;
=A0 max&lt;br&gt;
&amp;gt; 20 =A0 0.8377 =A0 =A00.9212 =A0 =A00.9057 =A0 =A00.0373 =A0 =A00.9505&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; or&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; R^2 =3D 0.91 +/- 0.04 for H =3D 20.&lt;br&gt;
&amp;gt;&lt;br&gt;
&amp;gt; So, ... excluding 3 and 6 doesn't appear to&lt;br&gt;
&amp;gt; significantly degrade performance.&lt;br&gt;
&lt;br&gt;
Moving on ...&lt;br&gt;
&lt;br&gt;
Removing x3, x6 and x7 yields&lt;br&gt;
&lt;br&gt;
For 10-fold XVAL with I-H-O =3D 5-H-1,&lt;br&gt;
&lt;br&gt;
Ntrn =3D 0.9*N  =3D 927&lt;br&gt;
Neq  =3D Ntrn*O =3D 927&lt;br&gt;
Nw =3D (I+1)*H+(H+1)*O =3D 7*H + 1&lt;br&gt;
Neq &amp;gt; 10*Nw  =3D=3D&amp;gt; H &amp;lt; 13.1&lt;br&gt;
&lt;br&gt;
&amp;nbsp;H       min      median     mean      stdv         max&lt;br&gt;
&amp;nbsp;9     0.8413    0.8922    0.8837    0.0243    0.9167&lt;br&gt;
11    0.8508    0.8887    0.8865    0.0244    0.9208&lt;br&gt;
20.   0.8552    0.9092    0.8970    0.0259    0.9269&lt;br&gt;
&lt;br&gt;
&amp;nbsp;9     R^2 =3D 0.88 +/- 0.02&lt;br&gt;
11    R^2 =3D 0.89 +/- 0.02&lt;br&gt;
20    R^2 =3D 0.90 +/- 0.03&lt;br&gt;
&lt;br&gt;
Hope this helps.&lt;br&gt;
&lt;br&gt;
Greg</description>
    </item>
  </channel>
</rss>

