Path: news.mathworks.com!not-for-mail
From: <HIDDEN>
Newsgroups: comp.soft-sys.matlab
Subject: Re: sort large amount of data
Date: Wed, 21 Jan 2009 18:23:02 +0000 (UTC)
Organization: The MathWorks, Inc.
Lines: 17
Message-ID: <gl7p66$8t7$1@fred.mathworks.com>
References: <gl6ojl$1ag$1@fred.mathworks.com>
Reply-To: <HIDDEN>
NNTP-Posting-Host: webapp-02-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1232562182 9127 172.30.248.37 (21 Jan 2009 18:23:02 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Wed, 21 Jan 2009 18:23:02 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 1187260
Xref: news.mathworks.com comp.soft-sys.matlab:512995

"Theodor Zouk" <rebet4@hotmail.com> wrote in message <gl6ojl$1ag$1@fred.mathworks.com>...
> Hi 
> I have very large amount of data (reell values of class double) that is registrered over the chronological appearence and saved in many .mat files. I want to sort this data in a ascending order. Now, SORT do this, but it can only do it per each .mat-file. The ideal would be if all the .mat files could be saved into one .mat file and then loaded into workspace and then sorted. But this can not be done when there is not enough memory. Any tips how i should 'loop' through all the mat files and then at the end have many mat files where the maximum value in one mat file is lower than the minimum value of the next consecutive mat file.
> 
> Best regards 
> Theo

  In merge sorting for example, the basic step is a comparison between the next two elements of two arrays to see which is smaller, with the smaller being selected for insertion in another array and being removed (in a sense) from the original array.  Assuming the two original arrays were already sorted, that results in a single combined array that is sorted.  If this is repeatedly performed, sorted arrays starting from single elements and working up to an entire large sorted array can be obtained.  It is an efficient order n*log2(n) algorithm.

  It seems to me that you can emulate such a procedure with your .mat files being considered as analogous to individual elements within "superarrays" which are not stored in memory simultaneously.  Starting with all the .mat files being sorted within themselves, you can bring in two .mat sorted files, merge them into two equally large .mat files which are mutually sorted in the sense that each element of one is less than each element of the other.  Now you have the analogy to two sorted elements.  By continuing this you eventually have all .mat files paired off into mutually sorted pairs.  The next stage is to do the same kind of "supermerging" to obtain sets of four .mat files which are mutually sorted, always taking care not to have more than four .mat files in computer memory at any one time.  Eventually all the .mat files will represent one entire sorted array, though of course it will be too large to be stored in memory.

  The sizes of these .mat files are dictated by the requirement that no more than four of them will ever be allowed to reside in memory at the same time.  Obviously there is the extra overhead of reading and writing numerous .mat files, but hopefully this should be no more time-consuming and maybe a lot less than the internal sorting of very long .mat file pairs.

  I hope all that makes sense to you, Theo.  I can envision it more easily than I can explain it.

Roger Stafford