http://www.mathworks.com/matlabcentral/newsreader/view_thread/242806
MATLAB Central Newsreader  sort large amount of data
Feed for thread: sort large amount of data
enus
©19942015 by MathWorks, Inc.
webmaster@mathworks.com
MATLAB Central Newsreader
http://blogs.law.harvard.edu/tech/rss
60
MathWorks
http://www.mathworks.com/images/membrane_icon.gif

Wed, 21 Jan 2009 09:07:01 +0000
sort large amount of data
http://www.mathworks.com/matlabcentral/newsreader/view_thread/242806#622896
Theodor Zouk
Hi <br>
I have very large amount of data (reell values of class double) that is registrered over the chronological appearence and saved in many .mat files. I want to sort this data in a ascending order. Now, SORT do this, but it can only do it per each .matfile. The ideal would be if all the .mat files could be saved into one .mat file and then loaded into workspace and then sorted. But this can not be done when there is not enough memory. Any tips how i should 'loop' through all the mat files and then at the end have many mat files where the maximum value in one mat file is lower than the minimum value of the next consecutive mat file.<br>
<br>
Best regards <br>
Theo

Wed, 21 Jan 2009 10:27:10 +0000
Re: sort large amount of data
http://www.mathworks.com/matlabcentral/newsreader/view_thread/242806#622908
Rune Allnor
On 21 Jan, 10:07, "Theodor Zouk" <reb...@hotmail.com> wrote:<br>
> Hi<br>
> I have very large amount of data (reell values of class double) that is r=<br>
egistrered over the chronological appearence and saved in many .mat files. =<br>
I want to sort this data in a ascending order. Now, SORT do this, but it ca=<br>
n only do it per each .matfile. The ideal would be if all the .mat files c=<br>
ould be saved into one .mat file and then loaded into workspace and then so=<br>
rted. But this can not be done when there is not enough memory. Any tips ho=<br>
w i should 'loop' through all the mat files and then at the end have many m=<br>
at files where the maximum value in one mat file is lower than the minimum =<br>
value of the next consecutive mat file.<br>
<br>
Two ideas come to mind:<br>
<br>
1) Loop through each file and build a *total* histogram<br>
of the values that appear in the total data set.<br>
Use this histogram to separate the data set by value<br>
into intermediate files. Sort these files, and merge.<br>
2) Find a copy of Knut's "The art of Computer Programming:<br>
Sorting and Searching" and read.<br>
<br>
Rune

Wed, 21 Jan 2009 18:23:02 +0000
Re: sort large amount of data
http://www.mathworks.com/matlabcentral/newsreader/view_thread/242806#623035
Roger Stafford
"Theodor Zouk" <rebet4@hotmail.com> wrote in message <gl6ojl$1ag$1@fred.mathworks.com>...<br>
> Hi <br>
> I have very large amount of data (reell values of class double) that is registrered over the chronological appearence and saved in many .mat files. I want to sort this data in a ascending order. Now, SORT do this, but it can only do it per each .matfile. The ideal would be if all the .mat files could be saved into one .mat file and then loaded into workspace and then sorted. But this can not be done when there is not enough memory. Any tips how i should 'loop' through all the mat files and then at the end have many mat files where the maximum value in one mat file is lower than the minimum value of the next consecutive mat file.<br>
> <br>
> Best regards <br>
> Theo<br>
<br>
In merge sorting for example, the basic step is a comparison between the next two elements of two arrays to see which is smaller, with the smaller being selected for insertion in another array and being removed (in a sense) from the original array. Assuming the two original arrays were already sorted, that results in a single combined array that is sorted. If this is repeatedly performed, sorted arrays starting from single elements and working up to an entire large sorted array can be obtained. It is an efficient order n*log2(n) algorithm.<br>
<br>
It seems to me that you can emulate such a procedure with your .mat files being considered as analogous to individual elements within "superarrays" which are not stored in memory simultaneously. Starting with all the .mat files being sorted within themselves, you can bring in two .mat sorted files, merge them into two equally large .mat files which are mutually sorted in the sense that each element of one is less than each element of the other. Now you have the analogy to two sorted elements. By continuing this you eventually have all .mat files paired off into mutually sorted pairs. The next stage is to do the same kind of "supermerging" to obtain sets of four .mat files which are mutually sorted, always taking care not to have more than four .mat files in computer memory at any one time. Eventually all the .mat files will represent one entire sorted array, though of course it will be too large to be stored in memory.<br>
<br>
The sizes of these .mat files are dictated by the requirement that no more than four of them will ever be allowed to reside in memory at the same time. Obviously there is the extra overhead of reading and writing numerous .mat files, but hopefully this should be no more timeconsuming and maybe a lot less than the internal sorting of very long .mat file pairs.<br>
<br>
I hope all that makes sense to you, Theo. I can envision it more easily than I can explain it.<br>
<br>
Roger Stafford

Wed, 21 Jan 2009 21:21:12 +0000
Re: sort large amount of data
http://www.mathworks.com/matlabcentral/newsreader/view_thread/242806#623082
Peter Boettcher
Rune Allnor <allnor@tele.ntnu.no> writes:<br>
<br>
> On 21 Jan, 10:07, "Theodor Zouk" <reb...@hotmail.com> wrote:<br>
>> Hi I have very large amount of data (reell values of class double)<br>
>> that is registrered over the chronological appearence and saved in<br>
>> many .mat files. I want to sort this data in a ascending order. Now,<br>
>> SORT do this, but it can only do it per each .matfile. The ideal<br>
>> would be if all the .mat files could be saved into one .mat file and<br>
>> then loaded into workspace and then sorted. But this can not be done<br>
>> when there is not enough memory. Any tips how i should 'loop' through<br>
>> all the mat files and then at the end have many mat files where the<br>
>> maximum value in one mat file is lower than the minimum value of the<br>
>> next consecutive mat file.<br>
><br>
> Two ideas come to mind:<br>
><br>
> 1) Loop through each file and build a *total* histogram<br>
> of the values that appear in the total data set.<br>
> Use this histogram to separate the data set by value<br>
> into intermediate files. Sort these files, and merge.<br>
> 2) Find a copy of Knut's "The art of Computer Programming:<br>
> Sorting and Searching" and read.<br>
<br>
Adding a few thoughts:<br>
<br>
Do you absolutely need a full sort? Could you stop at a histogram like<br>
Rune suggests?<br>
<br>
Do you have a priori knowledge of the range of the values, or even<br>
better the approximate distribution? If you know the distribution, you<br>
can skip the empirical histogram step and just toss all the values into<br>
bins that you determine from your known distribution.<br>
<br>
I would use raw binary files for the scratch space. There will be much<br>
less overhead than MAT files, and seemless appending of new values.<br>
<br>
<br>
% Skip this if you can say something about your distribution<br>
for n=1:num_files<br>
% read file n<br>
% build histogram, accumulate into "running" histogram. bins must<br>
% match<br>
end<br>
<br>
% choose number M of scratch files, set "edges" that separate files<br>
% equally<br>
% (see "histogram equalization", an image processing technique, for<br>
% help here)<br>
<br>
for m=1:M<br>
% fopen scratch files<br>
end<br>
<br>
for n=1:num_files<br>
% read file n<br>
% separate into file bins (use histc)<br>
% append data sets to all M scratch files<br>
end<br>
<br>
for m=1:M<br>
% rewind and read full file m<br>
% sort<br>
% output<br>
% close file m<br>
end<br>
<br>
<br>
<br>
Peter

Fri, 23 Jan 2009 09:55:04 +0000
Re: sort large amount of data
http://www.mathworks.com/matlabcentral/newsreader/view_thread/242806#623398
Theodor Zouk
Peter Boettcher <boettcher@ll.mit.edu> wrote in message <muy63k8s6d3.fsf@G99Boettcher.llan.ll.mit.edu>...<br>
> Rune Allnor <allnor@tele.ntnu.no> writes:<br>
> <br>
> > On 21 Jan, 10:07, "Theodor Zouk" <reb...@hotmail.com> wrote:<br>
> >> Hi I have very large amount of data (reell values of class double)<br>
> >> that is registrered over the chronological appearence and saved in<br>
> >> many .mat files. I want to sort this data in a ascending order. Now,<br>
> >> SORT do this, but it can only do it per each .matfile. The ideal<br>
> >> would be if all the .mat files could be saved into one .mat file and<br>
> >> then loaded into workspace and then sorted. But this can not be done<br>
> >> when there is not enough memory. Any tips how i should 'loop' through<br>
> >> all the mat files and then at the end have many mat files where the<br>
> >> maximum value in one mat file is lower than the minimum value of the<br>
> >> next consecutive mat file.<br>
> ><br>
> > Two ideas come to mind:<br>
> ><br>
> > 1) Loop through each file and build a *total* histogram<br>
> > of the values that appear in the total data set.<br>
> > Use this histogram to separate the data set by value<br>
> > into intermediate files. Sort these files, and merge.<br>
> > 2) Find a copy of Knut's "The art of Computer Programming:<br>
> > Sorting and Searching" and read.<br>
> <br>
> Adding a few thoughts:<br>
> <br>
> Do you absolutely need a full sort? Could you stop at a histogram like<br>
> Rune suggests?<br>
> <br>
> Do you have a priori knowledge of the range of the values, or even<br>
> better the approximate distribution? If you know the distribution, you<br>
> can skip the empirical histogram step and just toss all the values into<br>
> bins that you determine from your known distribution.<br>
> <br>
> I would use raw binary files for the scratch space. There will be much<br>
> less overhead than MAT files, and seemless appending of new values.<br>
> <br>
> <br>
> % Skip this if you can say something about your distribution<br>
> for n=1:num_files<br>
> % read file n<br>
> % build histogram, accumulate into "running" histogram. bins must<br>
> % match<br>
> end<br>
> <br>
> % choose number M of scratch files, set "edges" that separate files<br>
> % equally<br>
> % (see "histogram equalization", an image processing technique, for<br>
> % help here)<br>
> <br>
> for m=1:M<br>
> % fopen scratch files<br>
> end<br>
> <br>
> for n=1:num_files<br>
> % read file n<br>
> % separate into file bins (use histc)<br>
> % append data sets to all M scratch files<br>
> end<br>
> <br>
> for m=1:M<br>
> % rewind and read full file m<br>
> % sort<br>
> % output<br>
> % close file m<br>
> end<br>
> <br>
> <br>
> <br>
> Peter<br>
<br>
Thank you all for the tips and information!<br>
<br>
Answers to some question:<br>
<br>
> Do you absolutely need a full sort?<br>
yes I do.<br>
<br>
> Do you have a priori knowledge of the range of the values, or even<br>
> better the approximate distribution? <br>
Yes and no. I can get the min and max value for the entire data. I assume the distribution is unknown. Though it can be approximated (but it takes time).<br>
<br>
My solution is this:<br>
<br>
Instead of saving the data to many .mat files Im saving them into one binary file (FOPEN,FWRITE). <br>
Im using MEMMAPFILE to be able to index this large file (~5GB). I will then take the median from the data range/intervall and set it as an threshold, loop through the data and save the sorted data into two new files. Now, I now for instance that SORT function can be applied through MEMMAPFILE function at an certain size of x number of MB of data (depending on the platform). So when making these new files Im also testing them if they are less or equal of x MB. if so the file is sorted. If not, the file is searched for its min and max value and a new threshold is set and two new files are made and tested and so on.<br>
<br>
I dont know if it is faster to first approximate the distribution and then select the threshold(s)? <br>
<br>
Thanks again for your tips!<br>
Best regards<br>
Theo