Thread Subject: Age distributions, distance and shape

Subject: Age distributions, distance and shape

From: Oleg Komarov

Date: 4 Nov, 2009 10:31:02

Message: 1 of 10

Dear all,
i have age distributions by district of the entire population and of a sample.
I want to compare the age distribution of the population with the sample one.

I can't do it by graphical representation because I have 110 districts and my data covers 5 periods which would require plotting more than 500 distribution comparisons.

I should be able to calculate somehow a measure of distance between the two distributions and rank each district from most-to-least different.
If you have any advice on how to calculate this measure of distance, any scientific reference on the matter (ages distributions, distributions comparisons etc...) plz post it.

The simplest way would be an area difference between the population distributions but here comes a problem, plz look at picture in the link (On the "x-axis" is the age and on the "y-axis" is the number of people):
http://i36.tinypic.com/2a9r22q.png
In the case "A" I would obtain an area difference, significantly different from 0. The same would be for the case "B".
Suppose that case A and case B would yieald the same difference, the thing that makes the two examples totally different are the shapes!

So, how can i calculate the difference between the age distribution of the population against that of the sample "weighting" the result for the shape of the distributions (with similar shapes reducing the final distance making the case A of the picture "acceptable")

Any comment would be greatly appeciated!

Subject: Age distributions, distance and shape

From: Oleg Komarov

Date: 4 Nov, 2009 13:21:02

Message: 2 of 10

I have a simplier question, how can i test if one distribution is equal to another with a given confidence (lets say at 5%)?

Subject: Age distributions, distance and shape

From: Peter Perkins

Date: 4 Nov, 2009 13:48:18

Message: 3 of 10

Oleg Komarov wrote:
> Dear all,
> i have age distributions by district of the entire population and of a sample.
> I want to compare the age distribution of the population with the sample one.
>
> I can't do it by graphical representation because I have 110 districts and my data covers 5 periods which would require plotting more than 500 distribution comparisons.

Are you comparing all 550 dist'ns against each other? Or comparing sample to population in 550 cases? Depending on what form you have these distributions in, you might consider using BOXPLOT, with a pair of boxes within each district/period stratum. Boxplots are very easy to read, and you'll be able to visually pick out cases that are very different.

> I should be able to calculate somehow a measure of distance between the two distributions and rank each district from most-to-least different.
> If you have any advice on how to calculate this measure of distance, any scientific reference on the matter (ages distributions, distributions comparisons etc...) plz post it.

The simplest would be the Kolmogorov-Smirnov statistic, as in KSTEST or KSTEST2. But blindly computing a statistic without first considering what exactly you are trying to detect is pointless.

> The simplest way would be an area difference between the population distributions but here comes a problem, plz look at picture in the link (On the "x-axis" is the age and on the "y-axis" is the number of people):
> http://i36.tinypic.com/2a9r22q.png

Case A shows two curves, one of which is entirely above the other. Unless the tails do somethig very funny, one of those cannot be a probability density function, because they both must integrate to 1.


> In the case "A" I would obtain an area difference, significantly different from 0. The same would be for the case "B".
> Suppose that case A and case B would yieald the same difference, the thing that makes the two examples totally different are the shapes!
> So, how can i calculate the difference between the age distribution of the population against that of the sample "weighting" the result for the shape of the distributions (with similar shapes reducing the final distance making the case A of the picture "acceptable")

You need to decide what kind of differences you are looking for. Then worry about a statistic. You might be looking to detect a shape difference without worrying about location. Or you might care only about location of the mode. And so on.

Hope this helps.

Subject: Age distributions, distance and shape

From: Oleg Komarov

Date: 4 Nov, 2009 14:38:03

Message: 4 of 10

TO: Peter Perkins <Peter.Perkins@MathRemoveThisWorks.com> wrote in message
Thanks!
The picture was made carelessly in a draw-an-example fashion.
I'm actually performing the kolmogorov-smirnov and the Wilxon test on each pair of distributions (pop vs sample per district per year).

Subject: Age distributions, distance and shape

From: Oleg Komarov

Date: 4 Nov, 2009 16:17:19

Message: 5 of 10

 TO: Peter Perkins <Peter.Perkins@MathRemoveThisWorks.com> wrote in message
I'm facing now the problem which arises from the practical and statistical significance (I read a post where you discussed about it).

Since i test with the "kstest2" two dataset which are large (ex: for district "x" on year "0" population totals 1,788,122 units, while sample totals 693,058) i obtain a pValue of 0.

my datasets are organized as follows:
| age | # people |
    18 20012
    19 67238
    20 etc...
Both population and sample dataset have the same discrete range of age [18-100], therefore
can i supply the "kstest2" with the probabilities that a certain age appears in the dataset instead of the entire dataset of replicated ages?

Tnx in advance

Subject: Age distributions, distance and shape

From: Matt Fetterman

Date: 4 Nov, 2009 16:44:02

Message: 6 of 10

"Oleg Komarov" <oleg.komarov@hotmail.it> wrote in message <hcs9ef$bvl$1@fred.mathworks.com>...
> TO: Peter Perkins <Peter.Perkins@MathRemoveThisWorks.com> wrote in message
> I'm facing now the problem which arises from the practical and statistical significance (I read a post where you discussed about it).
>
> Since i test with the "kstest2" two dataset which are large (ex: for district "x" on year "0" population totals 1,788,122 units, while sample totals 693,058) i obtain a pValue of 0.
>
> my datasets are organized as follows:
> | age | # people |
> 18 20012
> 19 67238
> 20 etc...
> Both population and sample dataset have the same discrete range of age [18-100], therefore
> can i supply the "kstest2" with the probabilities that a certain age appears in the dataset instead of the entire dataset of replicated ages?
>

Would it be useful to fit to a Gaussian and then you could compare the mean and the width parameters ?

Subject: Age distributions, distance and shape

From: Oleg Komarov

Date: 4 Nov, 2009 17:02:03

Message: 7 of 10

> Would it be useful to fit to a Gaussian and then you could compare the mean and the width parameters ?
It sounds a weak procedure to me. Age data doesn't come from a gaussian dstribution...or am I wrong.
 [H0, pValue] = jbtest(Pop1y2002);
Warning: P is less than the smallest tabulated value, returning 0.001.

well, seems jarque-bera fails too in large samples...

Subject: Age distributions, distance and shape

From: Oleg Komarov

Date: 4 Nov, 2009 17:12:04

Message: 8 of 10

plz look a these ecdfs plotted:
http://i37.tinypic.com/mszkzm.png
doesn't seem to me a huge difference between the two distributions. What do you think?

Subject: Age distributions, distance and shape

From: Peter Perkins

Date: 5 Nov, 2009 15:28:22

Message: 9 of 10

Oleg Komarov wrote:
> TO: Peter Perkins <Peter.Perkins@MathRemoveThisWorks.com> wrote in message
> I'm facing now the problem which arises from the practical and statistical significance (I read a post where you discussed about it).
>
> Since i test with the "kstest2" two dataset which are large (ex: for district "x" on year "0" population totals 1,788,122 units, while sample totals 693,058) i obtain a pValue of 0.
>
> my datasets are organized as follows:
> | age | # people |
> 18 20012
> 19 67238
> 20 etc...
> Both population and sample dataset have the same discrete range of age [18-100], therefore
> can i supply the "kstest2" with the probabilities that a certain age appears in the dataset instead of the entire dataset of replicated ages?

You can do that sort of thing with KSTEST, but not KSTEST2. It would be easy enough to write such code yourself using KSTEST2 as a starting point.

Two things:

1) The K-S test is intended (as mentioned in the help) for continuous distributions. You will have to decide if you age distribution, defined on the discrete set of integers, is "continuous enough".

2) You should reread my previous post.


If you're using the K-S test just as a way to compute the statistic to find candidates for pairs that are different, then OK. But it sounds like you're planning on taking the K-S p-values quite seriously.

If you have paired samples (what you described as the "population" and the "sample"), and your goal is to see if each smaller sample could reasonably be thought of as a subsample of the larger population, you might compute Monte-Carlo p-values:

1) draw a subsample from the larger population, and compute the K-S stat.
2) do that, say, a thousand times
3) compute the K-S stat for the actual subsample.
4) compare (3) to the 1000 values from (2)

But whether or not this is really appropriate for what you're doing I can't say. Hope it helps, but you're on your own.

Subject: Age distributions, distance and shape

From: Oleg Komarov

Date: 6 Nov, 2009 10:03:03

Message: 10 of 10

Thanks to everybody, especially to Peter.

TO: Peter Perkins.
I tried the boxplots and the paired histograms and i concluded that the monte carlo simulation with subsampling (or rewriting the KS2 to adjust it to a cdf-type input, which TMW may consider in future release :) ) won't be worth the effort for my specific case, since 99% of the pairs (population age distribution vs sample age distribution) confirm that the sample isn't drawn from the pop.
Thank you again for the very useful insights!

Oleg

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
significance Oleg Komarov 4 Nov, 2009 11:06:34
practical Oleg Komarov 4 Nov, 2009 11:06:34
smirnov Oleg Komarov 4 Nov, 2009 11:06:34
kolmogorov Oleg Komarov 4 Nov, 2009 11:06:34
inference Oleg Komarov 4 Nov, 2009 08:24:03
age Oleg Komarov 4 Nov, 2009 05:34:05
statistic Oleg Komarov 4 Nov, 2009 05:34:05
distribution Oleg Komarov 4 Nov, 2009 05:34:04
rssFeed for this Thread
 

MATLAB Central Terms of Use

NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content. Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available via MATLAB Central. Read the complete Terms prior to use.

Contact us at files@mathworks.com