Skip to Main Content Skip to Search
Login
File Exchange
MATLAB Newsgroup
Link Exchange
  Blogs  
 Contest 
MathWorks.com

Thread Subject: Give me a Regression Problem

Subject: Give me a Regression Problem

From: Greg Heath

Date: 17 Jul, 2008 11:29:20

Message: 1 of 1

On Jul 16, 6:54=A0pm, baldrick <philbrier...@hotmail.com> wrote:
> On Jul 17, 5:12=A0am, Greg Heath <he...@alumni.brown.edu> wrote:
>
>
>
>
>
> > On Jul 15, 7:14 am, baldrick <philbrier...@hotmail.com> wrote:
> > -----SNIP
>
> > > An even quicker and not so dirty method to find the variable
> > > importance is to use the neural network model itself!
>
> > > Build a neural network model and then systematically randomise each o=
f
> > > the inputs in turn and see how much using a random value (from the
> > > same distribution) rather than the actual value destroys the model.
> > > Repeat this many times and take an average.
>
> > How many times did you repeat the randomizations?
> > How did you calculate the tabulated values for
> > =A0 a. scrambled correlation
> > =A0 b. relative importance
>
> > This procedure estimates the importance of each
> > predictor when all other variables are present. It is
> > the ranking at the first step of stepwise (and stage-
> > wise) backward elimination. In general, if correlations
> > between predictors are not insignificant, the rankings
> > above the last will change as the procedure is
> > continued; i.e., the last ranked variable is removed,
> > a new net is designed, and the randomization is
> > repeated to obtain the next lowest ranked variable.
>
> > In this sense, the technique does qualify as dirty.
>
> > However, as previously indicated by the dirtier, but
> > quicker, =A0linear and quadratic regression stagewise
> > backward elimination procedures, none of the variables
> > are insignificant.
>
> > Therefore, these rankings are credible.
>
> > Hope this helps.
>
> > Greg
>
> The randomizations were repeated 100 times for each variable being
> tested. It is no good doing it just once - the more the merrier. If
> you do it only a few times you will not get consistent results.
>
> The scrambled correlation is the new model r^2 when the variable in
> question is messed around with, or 'scrambled', averaged over the
> number of times the 'scrambling' is repeated.
>
> The relative importance is calculated by simple linear transformation
> based on the scrambled correlations, such that the variable whose
> scrambled correlation is lowest gets and importance of 1 and any
> variable whose scrambled correlation is the same as the normal model
> gets an importance of 0 (which means that it does not matter what
> value that variable has). It is possible to get negative importance
> which would mean randomizing that variable is actually improving the
> model!
>
> You have to drop statistical thinking to understand what this method
> is telling you. It is saying, 'if I use this current model, what will
> happen to the performance if one of my varibles goes belly up'.
>
> This is particularly important in areas such as credit risk. If you
> are using a field such as FICO score in your model, and FICO suddenly
> decide they are going to calculate it in a different way without
> telling you, or you loose the data feed, then you need to know what
> will happen to your model. Another example is personal income, which
> gradually increases over time - if your model is heavily reliant on
> income, then it will start to deteriorate quite quickly.
>
> There is also no reason why stongly correlated variables should not be
> used together. Historically this is to do with the maths behind
> finding the coefficients using certain techniques - inverting matrices
> and so on. Logically, I would rather have say, both income and bank
> balance in my model even if they were highly correlated. How do I know
> which one is the real driver of whatever is being predicted, and there
> is no reason why they should stay correlated (banks know your balance,
> but you could start lying about your income). =A0Having both in the
> model is kind of hedging your bets against things going wrong (you
> would want them to have similar importance though).
>
> Personally I use these importance calculations for initially trimming
> out the rubbish (as in the random numbers I put in the concrete data)
> and getting down to the variables of interest. It does save a lot of
> time. I have come accross model builders who have thousands of
> candidate variables and spend months inspecting each one in turn -
> only to end up getting rid of 95% of them.

This technique should be better than just clamping an input to it's
mean value.

Is this used in a backward elimination mode, i.e., toss out the
worst variable, design a new net and repeat?

Hope this helps.

Greg


Tags for this Thread

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

rssFeed for this Thread

envelope graphic E-mail this page to a colleague

Public Submission Policy
NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content. Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available via MATLAB Central. Read the complete Disclaimer prior to use.
Related Topics