|
Peter Boettcher <boettcher@ll.mit.edu> wrote in message <muy63k8s6d3.fsf@G99-Boettcher.llan.ll.mit.edu>...
> Rune Allnor <allnor@tele.ntnu.no> writes:
>
> > On 21 Jan, 10:07, "Theodor Zouk" <reb...@hotmail.com> wrote:
> >> Hi I have very large amount of data (reell values of class double)
> >> that is registrered over the chronological appearence and saved in
> >> many .mat files. I want to sort this data in a ascending order. Now,
> >> SORT do this, but it can only do it per each .mat-file. The ideal
> >> would be if all the .mat files could be saved into one .mat file and
> >> then loaded into workspace and then sorted. But this can not be done
> >> when there is not enough memory. Any tips how i should 'loop' through
> >> all the mat files and then at the end have many mat files where the
> >> maximum value in one mat file is lower than the minimum value of the
> >> next consecutive mat file.
> >
> > Two ideas come to mind:
> >
> > 1) Loop through each file and build a *total* histogram
> > of the values that appear in the total data set.
> > Use this histogram to separate the data set by value
> > into intermediate files. Sort these files, and merge.
> > 2) Find a copy of Knut's "The art of Computer Programming:
> > Sorting and Searching" and read.
>
> Adding a few thoughts:
>
> Do you absolutely need a full sort? Could you stop at a histogram like
> Rune suggests?
>
> Do you have a priori knowledge of the range of the values, or even
> better the approximate distribution? If you know the distribution, you
> can skip the empirical histogram step and just toss all the values into
> bins that you determine from your known distribution.
>
> I would use raw binary files for the scratch space. There will be much
> less overhead than MAT files, and seemless appending of new values.
>
>
> % Skip this if you can say something about your distribution
> for n=1:num_files
> % read file n
> % build histogram, accumulate into "running" histogram. bins must
> % match
> end
>
> % choose number M of scratch files, set "edges" that separate files
> % equally
> % (see "histogram equalization", an image processing technique, for
> % help here)
>
> for m=1:M
> % fopen scratch files
> end
>
> for n=1:num_files
> % read file n
> % separate into file bins (use histc)
> % append data sets to all M scratch files
> end
>
> for m=1:M
> % rewind and read full file m
> % sort
> % output
> % close file m
> end
>
>
>
> -Peter
Thank you all for the tips and information!
Answers to some question:
> Do you absolutely need a full sort?
yes I do.
> Do you have a priori knowledge of the range of the values, or even
> better the approximate distribution?
Yes and no. I can get the min and max value for the entire data. I assume the distribution is unknown. Though it can be approximated (but it takes time).
My solution is this:
Instead of saving the data to many .mat files Im saving them into one binary file (FOPEN,FWRITE).
Im using MEMMAPFILE to be able to index this large file (~5GB). I will then take the median from the data range/intervall and set it as an threshold, loop through the data and save the sorted data into two new files. Now, I now for instance that SORT function can be applied through MEMMAPFILE function at an certain size of x number of MB of data (depending on the platform). So when making these new files Im also testing them if they are less or equal of x MB. if so the file is sorted. If not, the file is searched for its min and max value and a new threshold is set and two new files are made and tested and so on.
I dont know if it is faster to first approximate the distribution and then select the threshold(s)?
Thanks again for your tips!
Best regards
Theo
|