<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/264950</link>
    <title>MATLAB Central Newsreader - Age distributions, distance and shape</title>
    <description>Feed for thread: Age distributions, distance and shape</description>
    <language>en-us</language>
    <copyright>&amp;copy;1994-2012 by MathWorks, Inc.</copyright>
    <webmaster>webmaster@mathworks.com</webmaster>
    <generator>MATLAB Central Newsreader</generator>
    <docs>http://blogs.law.harvard.edu/tech/rss</docs>
    <ttl>60</ttl>
    <image>
      <title>MathWorks</title>
      <url>http://www.mathworks.com/images/membrane_icon.gif</url>
    </image>
    <item>
      <pubDate>Wed, 04 Nov 2009 10:31:02 -0500</pubDate>
      <title>Age distributions, distance and shape</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/264950#691986</link>
      <author>Oleg Komarov</author>
      <description>Dear all,&lt;br&gt;
i have age distributions by district of the entire population and of a sample.&lt;br&gt;
I want to compare the age distribution of the population with the sample one.&lt;br&gt;
&lt;br&gt;
I can't do it by graphical representation because I have 110 districts and my data covers 5 periods which would require plotting more than 500 distribution comparisons.&lt;br&gt;
&lt;br&gt;
I should be able to calculate somehow a measure of distance between the two distributions and rank each district from most-to-least different.&lt;br&gt;
If you have any advice on how to calculate this measure of distance, any scientific reference on the matter (ages distributions, distributions comparisons etc...) plz post it.&lt;br&gt;
&lt;br&gt;
The simplest way would be an area difference between the population distributions but here comes a problem, plz look at picture in the link (On the &quot;x-axis&quot; is the age and on the &quot;y-axis&quot; is the number of people):&lt;br&gt;
&lt;a href=&quot;http://i36.tinypic.com/2a9r22q.png&quot;&gt;http://i36.tinypic.com/2a9r22q.png&lt;/a&gt;&lt;br&gt;
In the case &quot;A&quot; I would obtain an area difference, significantly different from 0. The same would be for the case &quot;B&quot;. &lt;br&gt;
Suppose that case A and case B would yieald the same difference, the thing that makes the two examples totally different are the shapes! &lt;br&gt;
&lt;br&gt;
So, how can i calculate the difference between the age distribution of the population against that of the sample &quot;weighting&quot; the result for the shape of the distributions (with similar shapes reducing the final distance making the case A of the picture &quot;acceptable&quot;)&lt;br&gt;
&lt;br&gt;
Any comment would be greatly appeciated!</description>
    </item>
    <item>
      <pubDate>Wed, 04 Nov 2009 13:21:02 -0500</pubDate>
      <title>Re: Age distributions, distance and shape</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/264950#692022</link>
      <author>Oleg Komarov</author>
      <description>I have a simplier question, how can i test if one distribution is equal to another with a given confidence (lets say at 5%)?</description>
    </item>
    <item>
      <pubDate>Wed, 04 Nov 2009 13:48:18 -0500</pubDate>
      <title>Re: Age distributions, distance and shape</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/264950#692034</link>
      <author>Peter Perkins</author>
      <description>Oleg Komarov wrote:&lt;br&gt;
&amp;gt; Dear all,&lt;br&gt;
&amp;gt; i have age distributions by district of the entire population and of a sample.&lt;br&gt;
&amp;gt; I want to compare the age distribution of the population with the sample one.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; I can't do it by graphical representation because I have 110 districts and my data covers 5 periods which would require plotting more than 500 distribution comparisons.&lt;br&gt;
&lt;br&gt;
Are you comparing all 550 dist'ns against each other?  Or comparing sample to population in 550 cases?  Depending on what form you have these distributions in, you might consider using BOXPLOT, with a pair of boxes within each district/period stratum.  Boxplots are very easy to read, and you'll be able to visually pick out cases that are very different.&lt;br&gt;
&lt;br&gt;
&amp;gt; I should be able to calculate somehow a measure of distance between the two distributions and rank each district from most-to-least different.&lt;br&gt;
&amp;gt; If you have any advice on how to calculate this measure of distance, any scientific reference on the matter (ages distributions, distributions comparisons etc...) plz post it.&lt;br&gt;
&lt;br&gt;
The simplest would be the Kolmogorov-Smirnov statistic, as in KSTEST or KSTEST2.  But blindly computing a statistic without first considering what exactly you are trying to detect is pointless.&lt;br&gt;
&lt;br&gt;
&amp;gt; The simplest way would be an area difference between the population distributions but here comes a problem, plz look at picture in the link (On the &quot;x-axis&quot; is the age and on the &quot;y-axis&quot; is the number of people):&lt;br&gt;
&amp;gt; &lt;a href=&quot;http://i36.tinypic.com/2a9r22q.png&quot;&gt;http://i36.tinypic.com/2a9r22q.png&lt;/a&gt;&lt;br&gt;
&lt;br&gt;
Case A shows two curves, one of which is entirely above the other.  Unless the tails do somethig very funny, one of those cannot be a probability density function, because they both must integrate to 1.&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
&amp;gt; In the case &quot;A&quot; I would obtain an area difference, significantly different from 0. The same would be for the case &quot;B&quot;. &lt;br&gt;
&amp;gt; Suppose that case A and case B would yieald the same difference, the thing that makes the two examples totally different are the shapes! &lt;br&gt;
&amp;gt; So, how can i calculate the difference between the age distribution of the population against that of the sample &quot;weighting&quot; the result for the shape of the distributions (with similar shapes reducing the final distance making the case A of the picture &quot;acceptable&quot;)&lt;br&gt;
&lt;br&gt;
You need to decide what kind of differences you are looking for.  Then worry about a statistic.  You might be looking to detect a shape difference without worrying about location.  Or you might care only about location of the mode.  And so on.&lt;br&gt;
&lt;br&gt;
Hope this helps.</description>
    </item>
    <item>
      <pubDate>Wed, 04 Nov 2009 14:38:03 -0500</pubDate>
      <title>Re: Age distributions, distance and shape</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/264950#692054</link>
      <author>Oleg Komarov</author>
      <description>TO: Peter Perkins &amp;lt;Peter.Perkins@MathRemoveThisWorks.com&amp;gt; wrote in message &lt;br&gt;
Thanks!&lt;br&gt;
The picture was made carelessly in a draw-an-example fashion.&lt;br&gt;
I'm actually performing the kolmogorov-smirnov and the Wilxon test on each pair of distributions (pop vs sample per district per year).</description>
    </item>
    <item>
      <pubDate>Wed, 04 Nov 2009 16:17:19 -0500</pubDate>
      <title>Re: Age distributions, distance and shape</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/264950#692100</link>
      <author>Oleg Komarov</author>
      <description>&amp;nbsp;TO: Peter Perkins &amp;lt;Peter.Perkins@MathRemoveThisWorks.com&amp;gt; wrote in message &lt;br&gt;
I'm facing now the problem which arises from the practical and statistical significance (I read a post where you discussed about it).&lt;br&gt;
&lt;br&gt;
Since i test with the &quot;kstest2&quot; two dataset which are large (ex: for district &quot;x&quot; on year &quot;0&quot;  population totals 1,788,122 units, while sample totals 693,058) i obtain a pValue of 0.&lt;br&gt;
&lt;br&gt;
my datasets are organized as follows:&lt;br&gt;
| age | # people |         &lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;18       20012&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;19       67238&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;20        etc...&lt;br&gt;
Both population and sample dataset have the same discrete range of age [18-100], therefore&lt;br&gt;
can i supply the &quot;kstest2&quot; with the probabilities that a certain age appears in the dataset instead of the entire dataset of replicated ages?&lt;br&gt;
&lt;br&gt;
Tnx in advance</description>
    </item>
    <item>
      <pubDate>Wed, 04 Nov 2009 16:44:02 -0500</pubDate>
      <title>Re: Age distributions, distance and shape</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/264950#692108</link>
      <author>Matt Fetterman</author>
      <description>&quot;Oleg Komarov&quot; &amp;lt;oleg.komarov@hotmail.it&amp;gt; wrote in message &amp;lt;hcs9ef$bvl$1@fred.mathworks.com&amp;gt;...&lt;br&gt;
&amp;gt;  TO: Peter Perkins &amp;lt;Peter.Perkins@MathRemoveThisWorks.com&amp;gt; wrote in message &lt;br&gt;
&amp;gt; I'm facing now the problem which arises from the practical and statistical significance (I read a post where you discussed about it).&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Since i test with the &quot;kstest2&quot; two dataset which are large (ex: for district &quot;x&quot; on year &quot;0&quot;  population totals 1,788,122 units, while sample totals 693,058) i obtain a pValue of 0.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; my datasets are organized as follows:&lt;br&gt;
&amp;gt; | age | # people |         &lt;br&gt;
&amp;gt;     18       20012&lt;br&gt;
&amp;gt;     19       67238&lt;br&gt;
&amp;gt;     20        etc...&lt;br&gt;
&amp;gt; Both population and sample dataset have the same discrete range of age [18-100], therefore&lt;br&gt;
&amp;gt; can i supply the &quot;kstest2&quot; with the probabilities that a certain age appears in the dataset instead of the entire dataset of replicated ages?&lt;br&gt;
&amp;gt; &lt;br&gt;
&lt;br&gt;
Would it be useful to fit to a Gaussian and then you could compare the mean and the width parameters ?</description>
    </item>
    <item>
      <pubDate>Wed, 04 Nov 2009 17:02:03 -0500</pubDate>
      <title>Re: Age distributions, distance and shape</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/264950#692112</link>
      <author>Oleg Komarov</author>
      <description>&amp;gt; Would it be useful to fit to a Gaussian and then you could compare the mean and the width parameters ?&lt;br&gt;
It sounds a weak procedure to me. Age data doesn't come from a gaussian dstribution...or am I wrong.&lt;br&gt;
&amp;nbsp;[H0, pValue] = jbtest(Pop1y2002);&lt;br&gt;
Warning: P is less than the smallest tabulated value, returning 0.001. &lt;br&gt;
&lt;br&gt;
well, seems jarque-bera fails too in large samples...</description>
    </item>
    <item>
      <pubDate>Wed, 04 Nov 2009 17:12:04 -0500</pubDate>
      <title>Re: Age distributions, distance and shape</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/264950#692117</link>
      <author>Oleg Komarov</author>
      <description>plz look a these ecdfs plotted: &lt;br&gt;
&lt;a href=&quot;http://i37.tinypic.com/mszkzm.png&quot;&gt;http://i37.tinypic.com/mszkzm.png&lt;/a&gt;&lt;br&gt;
doesn't seem to me a huge difference between the two distributions. What do you think?</description>
    </item>
    <item>
      <pubDate>Thu, 05 Nov 2009 15:28:22 -0500</pubDate>
      <title>Re: Age distributions, distance and shape</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/264950#692436</link>
      <author>Peter Perkins</author>
      <description>Oleg Komarov wrote:&lt;br&gt;
&amp;gt;  TO: Peter Perkins &amp;lt;Peter.Perkins@MathRemoveThisWorks.com&amp;gt; wrote in message &lt;br&gt;
&amp;gt; I'm facing now the problem which arises from the practical and statistical significance (I read a post where you discussed about it).&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; Since i test with the &quot;kstest2&quot; two dataset which are large (ex: for district &quot;x&quot; on year &quot;0&quot;  population totals 1,788,122 units, while sample totals 693,058) i obtain a pValue of 0.&lt;br&gt;
&amp;gt; &lt;br&gt;
&amp;gt; my datasets are organized as follows:&lt;br&gt;
&amp;gt; | age | # people |         &lt;br&gt;
&amp;gt;     18       20012&lt;br&gt;
&amp;gt;     19       67238&lt;br&gt;
&amp;gt;     20        etc...&lt;br&gt;
&amp;gt; Both population and sample dataset have the same discrete range of age [18-100], therefore&lt;br&gt;
&amp;gt; can i supply the &quot;kstest2&quot; with the probabilities that a certain age appears in the dataset instead of the entire dataset of replicated ages?&lt;br&gt;
&lt;br&gt;
You can do that sort of thing with KSTEST, but not KSTEST2.  It would be easy enough to write such code yourself using KSTEST2 as a starting point.&lt;br&gt;
&lt;br&gt;
Two things:&lt;br&gt;
&lt;br&gt;
1) The K-S test is intended (as mentioned in the help) for continuous distributions.  You will have to decide if you age distribution, defined on the discrete set of integers, is &quot;continuous enough&quot;.&lt;br&gt;
&lt;br&gt;
2) You should reread my previous post.&lt;br&gt;
&lt;br&gt;
&lt;br&gt;
If you're using the K-S test just as a way to compute the statistic to find candidates for pairs that are different, then OK.  But it sounds like you're planning on taking the K-S p-values quite seriously.&lt;br&gt;
&lt;br&gt;
If you have paired samples (what you described as the &quot;population&quot; and the &quot;sample&quot;), and your goal is to see if each smaller sample could reasonably be thought of as a subsample of the larger population, you might compute Monte-Carlo p-values:&lt;br&gt;
&lt;br&gt;
1) draw a subsample from the larger population, and compute the K-S stat.&lt;br&gt;
2) do that, say, a thousand times&lt;br&gt;
3) compute the K-S stat for the actual subsample.&lt;br&gt;
4) compare (3) to the 1000 values from (2)&lt;br&gt;
&lt;br&gt;
But whether or not this is really appropriate for what you're doing I can't say.  Hope it helps, but you're on your own.</description>
    </item>
    <item>
      <pubDate>Fri, 06 Nov 2009 10:03:03 -0500</pubDate>
      <title>Re: Age distributions, distance and shape</title>
      <link>http://www.mathworks.com/matlabcentral/newsreader/view_thread/264950#692637</link>
      <author>Oleg Komarov</author>
      <description>Thanks to everybody, especially to Peter.&lt;br&gt;
&lt;br&gt;
TO: Peter Perkins.&lt;br&gt;
I tried the boxplots and the paired histograms and i concluded that the monte carlo simulation with subsampling (or rewriting the KS2 to adjust  it to a cdf-type input, which TMW may consider in future release :) ) won't be worth the effort for my specific case, since 99% of the pairs (population age distribution vs sample age distribution) confirm that the sample isn't drawn from the pop. &lt;br&gt;
Thank you again for the very useful insights!&lt;br&gt;
&lt;br&gt;
Oleg</description>
    </item>
  </channel>
</rss>

