No BSD License  

Highlights from
Ordinal Data Modeling

image thumbnail
from Ordinal Data Modeling by Valen Johnson
Companion Software

data_analysis.m
echo on

% ***********************************************************************
%                   INTRODUCTION TO MATLAB
% ***********************************************************************

% Matlab is a system for doing numerical and graphical computation.
% It was initially written to provide a interface for the LINPACK
% and EISPACK linear algebra packages.  It has evolved over the years
% to become one of the premier high level languages for doing
% mathematical computation. 

% The goal of this tutorial to describe some basic features of the
% Matlab language with particular emphasis on its abilities in the
% areas of data manipulation, data analysis, and graphing.

%***********************************************************************
%                           NOTE
%-----------------------------------------------------------------------
%        WHENEVER YOU SEE THE WORD PAUSE THE M-FILE WILL PAUSE
%        PRESS ANY KEY TO CONTINUE.
%***********************************************************************

pause

%------------------------------------------------------------------------
%                           MATRIX
%------------------------------------------------------------------------
% The basic structure in Matlab is a matrix.  Let's look at a dataset
% that is discussed in Chapters 3 and 4.  For 30 students in a statistics
% class, we record the grade (A, B, C, D, F) and the score on the SAT
% exam that was taken during high school.  

% We represent this data by a matrix named 'class' that consists of 30 rows
% and 2 columns -- the first column contains the grades (coded 1 = F, 2 = D,
% 3 = C, 4 = B, 5 = A) and the second column contains the SAT scores.

% We use square brackets are used to describe the matrix.  Carriage
% returns to describe different rows.  Alternatively, a semicolon
% can be used to separate rows. Thee command is concluded with
% a semicolon.  

echo off

class=[ 3   525
        2   533
        3   545
        4   582
        2   581
        1   576
        3   572
        4   609
        2   559
        1   543
        3   576
        4   525
        0   574
        1   582
        2   574
        3   471
        3   595
        2   557
        4   557
        4   584
        3   599
        2   517
        4   649
        2   584
        1   463
        3   591
        2   488
        3   563
        3   553
        4   549 ];

echo on
pause

% We display 'class' by typing it without a semicolon at the end.

class

pause


%------------------------------------------------------------------------
%                        MATRIX MANIPULATIONS
%------------------------------------------------------------------------		
% Let's illustrate some basic manipulations of this data matrix.

% We define a list of numbers by a colon:

1:5

pause

% We look at the data for students 1-5 by use of the command

class(1:5,:)

pause

% Here we are extracting rows 1-5 and all columns (indicated by a :).
% Similarly, suppose we wish to extract the first column of the
% data matrix and assign it a column vector called 'grade'.  We extract
% the second column of 'class' and assign it to a variable called 'sat'.

grade=class(:,1); sat=class(:,2);

pause

% Note that we can have multiple statements on a line -- we separate the
% statements by ; (or ,)

% We can perform various computations on vectors and matrices.

% Suppose that we wish to scale the grades by multiplying by 10 and adding 5
% -- we'll call the new grade 'new_grade':

new_grade=10*grade+5

pause

% Suppose that we wish to square each grade -- we do this by the .^ operator.

grade.^2

pause

% grade is currently a column vector, we can convert it to a row vector by
% the transpose (') operator:

grade'

pause

% Suppose that we wish to add 'new_grade' to 'sat'.  The addition (+) 
% operator works for matrices if the dimensions line up.

new_grade+sat

pause

% Matlab also supports logical operators.  

% Suppose that we want to define a passing grade -- if grade is 3 or higher,
% then pass = 1, otherwise pass = 0.  We define this using >=

pass=(grade>=3)

pause

% We want to define a variable 'C', which is equal to 1 if the grade is 3
% and 0 otherwise.  We use the logical equal (==) operator:

C=(grade==3)

pause

%------------------------------------------------------------------------
%                     DATA ANALYSIS COMMANDS
%------------------------------------------------------------------------

% Matlab supports many data analysis operations.  The ones described below
% are intrinsic to Matlab.  Many others are available in the Statistics
% Toolbox.

% The 'length' command gives the number of entries in a vector:

length(grade)

pause

% The commands below find the mean and standard deviation of the column 
% vector 'sat':

mean(sat), std(sat)

pause

% Here is an alternative way of computing a standard deviation -- it 
% illustrates the 'sum' and 'sqrt' commands:

sqrt(sum((sat-mean(sat)).^2/(length(grade)-1)))

pause

% Suppose we wish to tally the different grades in the vector 'grade'.
% This can be done using the Matlab 'hist' function (the graphical version
% of this command will be illustrated later).  

freq=hist(grade,0:4)

% We see that there is 1 F, 4 D's, 8 C's, 10 B's, and 7 A's.

pause

% We can find the number of D's by use of a == logical operator and a 'sum':

num_D=sum(grade==1)

pause

% Likewise, the proportion of A's is given by

prop_A=sum(grade==4)/length(grade)

pause

%------------------------------------------------------------------------
%                     GRAPHING COMMANDS
%------------------------------------------------------------------------

% Matlab has a large number of graphing commands.  Here we illustrate some
% basic Matlab commands that are helpful in data analysis.

% A histogram of the SAT scores is constructed by the 'hist' command.

hist(sat)

pause

% Suppose that we wish to use bin midpoints of 460, 480, ..., 660. We first
% define a variable, called 'bins', which contains 460 to 660 in steps of 20:

bins=460:20:600;

pause

% The histogram using these bin midpoints is given by

hist(sat,bins)

pause

% Suppose that we wish to construct a scatterplot of 'grade' (vertical axis)
% against 'sat' (horizontal axis).  We use the 'plot' command.  The horizontal
% variable goes first, and we indicate a plotting symbol at the end.

plot(sat,grade,'o')

pause

% We label this plot by the 'xlabel', 'ylabel', 'title' commands:

xlabel('SAT'); ylabel('GRADE'); title('SCATTERPLOT OF GRADE AGAINST SAT')

pause

% We look at this graph -- we see a slight positive relationship between
% SAT and GRADE.  This motivates us to fit a line.  

% We first form a regression matrix.  The matrix X contains two columns - the first
% column is a ones vector (called ones(30,1)) and the second column are the sat scores.

X=[ones(30,1) sat];

pause

% A least-squares fit of SAT on GRADE is found in Matlab using a matrix divide notation -- 
% the column vector 'b' contains the regression coefficients:

b=X\grade

pause

% Let's put this least-squares fit on the graph.  We define points on the line:

xpt=[450 750]; ypt=b(1)+b(2)*xpt;

pause

% The 'line' command will connect the points and put the line on the current plot:

line(xpt,ypt)

pause

%------------------------------------------------------------------------
%                    WORKING WITH SIMULATED DATA
%------------------------------------------------------------------------

% Let's review some of the above commands and introduce some new ones by
% doing some data analysis on simulated data.

% We simulate numbers from a standard normal distribution.  We create a
% matrix of 10 rows and 100 columns and assign it to the matrix 'rand_norm':

rand_norm=randn(10,100);

pause

% Let's construct a histogram of the first row of this matrix -- it should
% look bell shaped and centered about the value 0.

hist(rand_norm(1,:))

pause

% Another property of these simulated values is that they are uncorrelated.
% To show this, we plot the entire stream of simulated values.  We collapse
% the matrix 'rand_norm' to a column vector 'rand_c' (using the colon 
% operation and then plot the elements of 'rand_c' as a line graph.

rand_c=rand_norm(:);
plot(rand_c)

% Note that there is no pattern or trend in the graph which indicates that
% they are uncorrelated.

pause

% Next, let's explore characteristics of sample means of normal samples.
% The following command finds the mean of each column of 'rand_norm' -- the
% resulting row vector is stored in 'means':

means=mean(rand_norm);

pause

% The values in 'means' are normally distributed with mean 0 and standard
% deviation sqrt(1/10) -- let's display the distribution.

hist(means)

pause

% We can use the random matrix 'rand_norm' to illustrate other functions of
% normals.  First we'll square each element of the matrix (by the .^ operator)
% and put the result in 'chisqs' -- the elements are independent
% chi-square(1):

chisqs=rand_norm.^2;

pause

% If we sum across rows, we'll get a random sample from a chi square 
% distribution with 10 degrees of freedom.

chisq10=sum(chisqs);

pause

% We'll display a histogram of these values.

hist(chisq10)

pause

% Looking at a chi-square table, we see that the upper 10 percentage point
% of a chi-square (10) is 14.68.  So we expect that roughly 10 percent of
% our random chi-square(10)'s are larger than 14.68.  Let's check:

sum(chisq10>14.68)

% This is the number of simulated values larger than 14.68 out of 100.
% Hopefully this is about 10 percent.


echo off

Contact us at files@mathworks.com