Initially, I am interested in dimension reduction as I wanted to reduce the plot down from 27 to either a 2 or 3 dimensional plot, that was why I decided to use PCA.
I also understand that in PCA the orthogonal transformed input gives the most spread and I thought it could be helpful in visualizing my data with maximum spread on the input variables. I am not doing any form of classification, just to clarify, I do not have classes in which I hope my data will sit into.
But after plotting, I realised some overfit issue and I thought maybe I used too many input variables. Thus, I decided to remove some of the variables, but I do not know which constitute more and which are less significant in which I can remove. I tried by removing the variables that produce smaller projection on the plot and the result did not seem to improve, instead worsen. Hence, I thought maybe I should find out which variables contribute most to the first 2 PCs for a 2D plot.
Is my line of thoughts right? So if that is right, will STEPWISEFIT as mentioned earlier in the discussion, help in finding the variable importance? Or would some other method be more effective?
Please advice. Thanks.
"Greg Heath" <heath@alumni.brown.edu> wrote in message <ke349g$gpu$1@newscl01ah.mathworks.com>...
> "Maureen " <maureen_510@hotmail.com> wrote in message <ke191s$t5t$1@newscl01ah.mathworks.com>...
> > "Greg Heath" <heath@alumni.brown.edu> wrote in message <kduohk$nrj$1@newscl01ah.mathworks.com>...
> > > "Maureen " <maureen_510@hotmail.com> wrote in message <kdrm1o$e57$1@newscl01ah.mathworks.com>...
> > > > I have 350 observation and 27 variables. So I want to use PCA for dimension reduction purpose to plot the 350 observation on a 2D plot, which effectively means that I will only be using PC1 and PC2. My purpose is just to see their relationship on a 2D plot.
> > > >
> > > > But how do I determine which of my original variables contribute most to the first two principle components and which of the variables are less important in which I can discard? I have saw many similar post online but have not come up with a solution. Where should I go from here?
> > >
> > > You have not indicated
> > >
> > > 1. whether the task is classification or regression
> > > 2. if any of the 27 are ouputs
> > > 3. the number of output variables
> > >
> > > The most important is 1 because PCA is inapprorpriate for classification. Therefore, I'll
> > > assume the task is regression.
> > >
> > My task is definitely not classification, but I am not sure if it is regression though (not too familiar with regression even after reading through some materials). My main objective is to plot out the 350 observation on a 2D plot and examine their relationship, in the sense if the observations are plotted closer together are they more similar based on the 27 features.
> >
> > > The next most important is 2 because PCA is only used to transform the input space.
> > > Therefore I'll assume 27 original input variables.
> > >
> > > 3 is still important becase it affects what algorithms/techniques should be used.
> > >
> > The 27 original input variables are inputs and I do not have an output variable.
> >
> > > > I have read through the documentation on feature selection, and some people >suggested using stepwisefit and other regression methods.
> > >
> > > Yes. The best criterion to use is one that optimizes a specific function of the output variables.
> > >
> > I have got no idea what should the criteria be since I do not have a specific function of the output variables.
> > I just want to examine if the relationship of the observations and the variables, to find out if the closer the points (observations) on the plot, if the higher the similarity between the observation.
> >
> > > >I do not have much background with regression, so do correct me if I am wrong. Based on my readings, I believe I would need to have a set of criteria to select the features, in >which I do not have an idea what should the criteria be.
> > >
> > > If it's regression, it is simple, just read the STEPWISEFIT documentation.
> > >
> > > If it is classification, then you should not be using PCA because there is no reason why
> > > PCA space should be preferred over the original.
> > >
> > This is not for classification purpose.
> >
> > Am I right to say that if I use STEPWISEFIT, the variable Y used would be PC1? And similarly I would have to do it for at least PC2 and PC3 as well, since I am interested to know how much the original variables contribute to PC1 and PC2 (PC3 maybe, depending on how many dimension I intend to go for later). And after which adding up the absolute values from the 2 results from STEPWISEFIT (PC1 and PC2) to determine which original variables contribute the most to the first two PCs.
> >
> > > > Also there should be a set of output, Y in order to perform stepwisefit. But for my case, all 27 variables are my features, which is the input so to speak and I do not have a set of output.
> > > >
> > > > So if not using regression, may I know where do I go from here, so that I can determine the importance of my original set of variables? In other words, I need to find the contribution of the original variables to PC1 and PC2.
> > > >
> > > > Appreciate any help/ suggestion. Thanks in advance!
> > >
> > > If you don't know what you want to optimize, then there is no reason to use PCA
> > > over the original variables.
> > >
> > > What do you want to do with the data?? What is your ultimate goal.
> > >
> > My main purpose is to optimise the plot so that observations that are generally similar are plotted closer and those that differs greatly are plotter further away from each other. But right now I seem to have a little bit of overfit problem whereby admits the similar observations, there are a few plots that are plotted at the wrong place.
>
> You seem rather confused.
>
> Originally you postulated that you want to do something based on PCA without fully understanding why you are using PCA.
>
> PCA is predominantly used to discover and rank the orthogonal directions in which the input data has the most spread without considering the task for which the data will be used.
>
> PCA is used in regression based on the idea that the orthogonal tranformed input variables with the most spread are probably the variables that best explain the spread in the output data. This is not always true but the transformation can still be useful if the subset of PCs that is selected is based on the ability to represent the spread in the output data.
>
> PCA is used in classification based on the idea that the orthogonal transformed input variables with the most spread are probably the variables that best explain the separation between classes of data. This is not always true but the transformation can still be useful if the subset of PCs that is selected is based on the ability to represent the class separation.
>
> Whenever the output data is available, PLS tends to be better than PCA because it ranks
> tranformed input variables based on how much they contribute to the understanding of the I/O relationship in both classification and regression.
>
> Since you are just interested in visualizing general relationships between all variables without regard to spread or separation,
>
> 1. Standardize (help zscore) the variables to zero mean/unit variance
> 2. Project the results on all xj vs xi (j > i) planes
> 3. Transform to an orthogonal basis e.g., PCA
> 4. Repeat 2.
>
> You may also want to cluster the data and color code the projections based
> on cluster membership.
>
> Hpe this helps.
>
> Greg
