This example shows how to fit multiple probability distribution objects to the same set of sample data, and obtain a visual comparison of how well each distribution fits the data.

Load the sample data.

```
load carsmall;
```

This data contains miles per gallon (`MPG`

)
measurements for different makes and models of cars, grouped by country
of origin (`Origin`

), model year (`Model_Year`

),
and other vehicle characteristics.

Transform `Origin`

into a nominal array and
remove the Italian car from the sample data.

Origin = nominal(Origin); MPG2 = MPG(Origin~='Italy'); Origin2 = Origin(Origin~='Italy'); Origin2 = droplevels(Origin2,'Italy');

Since there is only one Italian car, `fitdist`

cannot
fit a distribution to that group. Removing the Italian car from the
sample data prevents `fitdist`

from producing an
error.

Use `fitdist`

to fit Weibull, normal, logistic,
and kernel distributions to each country of origin group in the `MPG`

data.

[WeiByOrig,Country] = fitdist(MPG2,'weibull','by',Origin2); [NormByOrig,Country] = fitdist(MPG2,'normal','by',Origin2); [LogByOrig,Country] = fitdist(MPG2,'logistic','by',Origin2); [KerByOrig,Country] = fitdist(MPG2,'kernel','by',Origin2); WeiByOrig Country

WeiByOrig = Column 1 [1x1 prob.WeibullDistribution] Column 2 [1x1 prob.WeibullDistribution] Column 3 [1x1 prob.WeibullDistribution] Column 4 [1x1 prob.WeibullDistribution] Column 5 [1x1 prob.WeibullDistribution] Country = 'France' 'Germany' 'Japan' 'Sweden' 'USA'

Each country group now has four distribution objects associated
with it. For example, the cell array `WeiByOrig`

contains
five Weibull distribution objects, one for each country represented
in the sample data. Likewise, the cell array `NormByOrig`

contains
five normal distribution objects, and so on. Each object contains
properties that hold information about the data, distribution, and
parameters. The array `Country`

lists the country
of origin for each group in the same order as the distribution objects
are stored in the cell arrays.

Extract the four probability distribution objects for USA and compute the pdf for each distribution. As shown in Step 3, USA is in position 5 in each cell array.

WeiUSA = WeiByOrig{5}; NormUSA = NormByOrig{5}; LogUSA = LogByOrig{5}; KerUSA = KerByOrig{5}; x = 0:1:50; pdf_Wei = pdf(WeiUSA,x); pdf_Norm = pdf(NormUSA,x); pdf_Log = pdf(LogUSA,x); pdf_Ker = pdf(KerUSA,x);

Plot the pdf for each distribution fit to the USA data, superimposed on a histogram of the sample data. Scale the density by the histogram area for easier display.

% Create a histogram of the USA sample data data = MPG(Origin2=='USA'); figure; [n,y] = hist(data,10); b = bar(y,n,'hist'); set(b,'FaceColor',[1,0.8,0]); % Scale the density by the histogram area for easier display area = sum(n)*(y(2)-y(1)); time = 0:50; pdfWei = pdf(WeiUSA,time); pdfNorm = pdf(NormUSA,time); pdfLog = pdf(LogUSA,time); pdfKer = pdf(KerUSA,time); % Plot the pdf of each fitted distribution line(x,pdfWei*area,'LineStyle','-','Color','r'); hold on; line(x,pdfNorm*area,'LineStyle','-.','Color','b'); line(x,pdfLog*area,'LineStyle','--','Color','g'); line(x,pdfKer*area,'LineStyle',':','Color','k'); l = legend('Data','Weibull','Normal','Logistic','Kernel'); set(l,'Location','Best'); title('MPG for Cars from USA'); xlabel('MPG'); hold off;

Superimposing the pdf plots over a histogram of the sample data
provides a visual comparison of how well each type of distribution
fits the data. Only the nonparametric kernel distribution `KerUSA`

comes
close to revealing the two modes in the original data.

To investigate the two modes revealed in Step 5, group the `MPG`

data
by both country of origin (`Origin`

) and model year
(`Model_Year`

), and use `fitdist`

to
fit kernel distributions to each group.

[KerByYearOrig,Names] = fitdist(MPG,'Kernel','By',{Origin Model_Year});

Each unique combination of origin and model year now has a kernel distribution object associated with it.

Extract the three probability distributions for each USA model
year, which are in positions 12, 13, and 14 in the cell array `KerByYearOrig`

.
Compute each pdf.

USA70 = KerByYearOrig{12}; USA76 = KerByYearOrig{13}; USA82 = KerByYearOrig{14}; pdf70 = pdf(USA70,x); pdf76 = pdf(USA76,x); pdf82 = pdf(USA82,x);

Plot the pdf for each group on the same figure.

figure; plot(x,pdf70,'r-'); hold on; plot(x,pdf76,'b-.'); plot(x,pdf82,'k:'); legend({'1970','1976','1982'},'Location','NW'); title('MPG in USA Cars by Model Year'); xlabel('MPG'); hold off;

When further grouped by model year, the pdf plots reveal two
distinct peaks in the `MPG`

data for cars made in
the USA — one for the model year 1970, and one for the model
year 1982. This explains why the smooth curve produced by the kernel
distribution for the combined USA miles per gallon data shows two
peaks instead of one.

Was this topic helpful?