how to know the distribution of my data

Dear Matlab Community, I have attached an excel file for some data I have. this data represents the percent of loads in each load bin with their histogram , my question is how can I know using MATLAB what distribution my data follows? is it normal? exponential or something else? and after that how to know the parameters of the distribution. Thanks alot

8 Comments

Do you have the Statistics Toolbox?
There are fitting functions therein for quite a number of distributions and specific hypothesis tests llietest for normal, exponential, and extreme value.
Do you have any idea of the theoretical shape or where the data came from? For example, they're particle sizing measurements so it's likely to be log-normal? Or it's two Gaussians?
dpb, I am not sure I have this Toolbox, how can I get it please?
Image analyst, since it is a traffic data (huge data) i believe it must be normally distributed, but not sure yet
All products from The Mathworks come from their normal distribution channel for your user type -- student, individual, industry, ... for your locale. How did you get access to the copy you're currently using of base product?
ver
at the command line will show which version you have and which Toolboxes are installed.
I am PhD student, MATLAB Version: 9.2.0.538062 (R2017a)
Your institution may have access to more; ask your advisor for what is available for your use on university machines.
thank you for your help

Sign in to comment.

 Accepted Answer

When I fit the data to the sum of 3 Gaussians, the fit looks pretty reasonable. What do you think? And why do you need analytical equation(s) for the distribution rather than just using the ACTUAL distribution obtained from the histogram.
% Uses fitnlm() to fit a non-linear model (sum of three Gaussians with an offset) through noisy data.
% Requires the Statistics and Machine Learning Toolbox, which is where fitnlm() is contained.
% Initialization steps.
clc; % Clear the command window.
close all; % Close all figures (except those of imtool.)
clear; % Erase all existing variables. Or clearvars if you want.
workspace; % Make sure the workspace panel is showing.
format long g;
format compact;
fontSize = 20;
% % Create the X coordinates from 0 to 20 every 0.5 units.
% X = linspace(0, 40000, 4000);
% mu1 = 6000; % Mean, center of Gaussian.
% sigma1 = 2000; % Standard deviation.
% mu2 = 13000; % Mean, center of Gaussian.
% sigma2 = 2500; % Standard deviation.
%
% % Define function that the X values obey.
% a = 0 % Arbitrary sample values I picked.
% b = 3
% c = 18
% Y = a + b * exp(-((X - mu1)/sigma1) .^ 2) + ...
% c * exp(-((X - mu2)/sigma2) .^ 2); % Get a vector. No noise in this Y yet.
% X=X';
% Y=Y';
data = xlsread('matlab.xlsx');
X = data(:, 1);
Y = data(:, 2);
% Now we have noisy training data that we can send to fitnlm().
% Plot the noisy initial data.
plot(X, Y, 'b.', 'LineWidth', 2, 'MarkerSize', 15);
grid on;
drawnow;
% Convert X and Y into a table, which is the form fitnlm() likes the input data to be in.
tbl = table(X, Y);
% Define the model as Y = a + exp(-b*x)
% Note how this "x" of modelfun is related to big X and big Y.
% x((:, 1) is actually X and x(:, 2) is actually Y - the first and second columns of the table.
modelfun = @(b,x) b(1) + b(2) * exp(-((x(:, 1) - b(3))/b(4)).^2) + ...
b(5) * exp(-((x(:, 1) - b(6))/b(7)).^2) + ...
b(8) * exp(-((x(:, 1) - b(9))/b(10)).^2);
beta0 = [0, 2, 6000, 2000, 18, 13000, 2000, 2, 14000, 9000]; % Guess values to start with. Just make your best guess.
% Now the next line is where the actual model computation is done.
mdl = fitnlm(tbl, modelfun, beta0);
% Now the model creation is done and the coefficients have been determined.
% YAY!!!!
% Extract the coefficient values from the the model object.
% The actual coefficients are in the "Estimate" column of the "Coefficients" table that's part of the mode.
coefficients = mdl.Coefficients{:, 'Estimate'}
% Let's do a fit, but let's get more points on the fit, beyond just the widely spaced training points,
% so that we'll get a much smoother curve.
X = linspace(min(X), max(X), 1920); % Let's use 1920 points, which will fit across an HDTV screen about one sample per pixel.
% Create smoothed/regressed data using the model:
yFitted = coefficients(1) + coefficients(2) * exp(-((X - coefficients(3))/ coefficients(4)) .^2) + ...
coefficients(5) * exp(-((X - coefficients(6))/ coefficients(7)) .^2) + ...
coefficients(8) * exp(-((X - coefficients(9))/ coefficients(10)) .^2);
% Now we're done and we can plot the smooth model as a red line going through the noisy blue markers.
hold on;
plot(X, yFitted, 'r-', 'LineWidth', 2);
grid on;
title('Exponential Regression with fitnlm()', 'FontSize', fontSize);
xlabel('X', 'FontSize', fontSize);
ylabel('Y', 'FontSize', fontSize);
legendHandle = legend('Noisy Y', 'Fitted Y', 'Location', 'northeast');
legendHandle.FontSize = 25;
% Set up figure properties:
% Enlarge figure to full screen.
set(gcf, 'Units', 'Normalized', 'OuterPosition', [0 0 1 1]);
% Get rid of tool bar and pulldown menus that are along top of figure.
% set(gcf, 'Toolbar', 'none', 'Menu', 'none');
% Give a name to the title bar.
set(gcf, 'Name', 'Demo by ImageAnalyst', 'NumberTitle', 'Off')

1 Comment

this is genius and beautiful, I thank you very very much for your amazing help

Sign in to comment.

More Answers (2)

dpb
dpb on 25 Sep 2018
Plotting the data it definitely is not normal; has long RH tail and isn't symmetric.
For hypothesis testing it would be better to go back to the underlying data from which the histogram was made if you have it.

4 Comments

yes i did this i went back to the original data and i found it was normal
Doesn't look normal to me. At least not Gaussian. Maybe that's a "normal" (i.e. typical) distribution for traffic volume, but it doesn't look very Gaussian.
actually when i went back to the original data (55000 rows) i found out that it is normal
By what measure? As IA says, it looks bimodal (if not tri, that's kinda suspicious hump at the LH side of the central lobe) and the RH tail is definitely not consistent with Gaussian.
If the raw data look markedly different that would be surprising.

Sign in to comment.

Since your data didn't look like one Gaussian to me, I fit it to the sum of two Gaussians with the attached m-file. I got this:

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!