Path: news.mathworks.com!not-for-mail
From: "Theodor Zouk" <rebet4@hotmail.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: sort large amount of data
Date: Fri, 23 Jan 2009 09:55:04 +0000 (UTC)
Organization: Univ of Umea
Lines: 94
Message-ID: <glc45o$sff$1@fred.mathworks.com>
References: <gl6ojl$1ag$1@fred.mathworks.com> <muy63k8s6d3.fsf@G99-Boettcher.llan.ll.mit.edu>
Reply-To: "Theodor Zouk" <rebet4@hotmail.com>
NNTP-Posting-Host: webapp-02-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1232704504 29167 172.30.248.37 (23 Jan 2009 09:55:04 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Fri, 23 Jan 2009 09:55:04 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 1540208
Xref: news.mathworks.com comp.soft-sys.matlab:513358


Peter Boettcher <boettcher@ll.mit.edu> wrote in message <muy63k8s6d3.fsf@G99-Boettcher.llan.ll.mit.edu>...
> Rune Allnor <allnor@tele.ntnu.no> writes:
> 
> > On 21 Jan, 10:07, "Theodor Zouk" <reb...@hotmail.com> wrote:
> >> Hi I have very large amount of data (reell values of class double)
> >> that is registrered over the chronological appearence and saved in
> >> many .mat files. I want to sort this data in a ascending order. Now,
> >> SORT do this, but it can only do it per each .mat-file. The ideal
> >> would be if all the .mat files could be saved into one .mat file and
> >> then loaded into workspace and then sorted. But this can not be done
> >> when there is not enough memory. Any tips how i should 'loop' through
> >> all the mat files and then at the end have many mat files where the
> >> maximum value in one mat file is lower than the minimum value of the
> >> next consecutive mat file.
> >
> > Two ideas come to mind:
> >
> > 1) Loop through each file and build a *total* histogram
> >    of the values that appear in the total data set.
> >    Use this histogram to separate the data set by value
> >    into intermediate files. Sort these files, and merge.
> > 2) Find a copy of Knut's "The art of Computer Programming:
> >    Sorting and Searching" and read.
> 
> Adding a few thoughts:
> 
> Do you absolutely need a full sort?  Could you stop at a histogram like
> Rune suggests?
> 
> Do you have a priori knowledge of the range of the values, or even
> better the approximate distribution?  If you know the distribution, you
> can skip the empirical histogram step and just toss all the values into
> bins that you determine from your known distribution.
> 
> I would use raw binary files for the scratch space.  There will be much
> less overhead than MAT files, and seemless appending of new values.
> 
> 
> % Skip this if you can say something about your distribution
> for n=1:num_files
>   % read file n
>   % build histogram, accumulate into "running" histogram.  bins must
>   %  match
> end
> 
> % choose number M of scratch files, set "edges" that separate files
> %  equally
> %  (see "histogram equalization", an image processing technique, for
> %  help here)
> 
> for m=1:M
>   % fopen scratch files
> end
> 
> for n=1:num_files
>   % read file n
>   % separate into file bins (use histc)
>   % append data sets to all M scratch files
> end
> 
> for m=1:M
>   % rewind and read full file m
>   % sort
>   % output
>   % close file m
> end
> 
> 
> 
> -Peter

Thank you all for the tips and information!

Answers to some question:

> Do you absolutely need a full sort?
yes I do.

> Do you have a priori knowledge of the range of the values, or even
> better the approximate distribution? 
Yes and no. I can get the min and max value for the entire data. I assume the distribution is unknown. Though it can be approximated (but it takes time).

My solution is this:

Instead of saving the data to many .mat files Im saving them into one binary file (FOPEN,FWRITE). 
Im using MEMMAPFILE to be able to index this large file (~5GB). I will then take the median from the data range/intervall and set it as an threshold, loop through the data and save the sorted data into two new files. Now, I now for instance that SORT  function can be applied through MEMMAPFILE function at an certain size of x number of MB of data (depending on the platform). So when making these new files Im also testing them if they are less or equal of x MB. if so the file is sorted. If not, the file is searched for its min and max value and a new threshold is set and two new files are made and tested and so on.

I dont know if it is faster to first approximate the distribution and then select the threshold(s)? 

Thanks again for your tips!
Best regards
Theo