Path: news.mathworks.com!newsfeed-00.mathworks.com!panix!bloom-beacon.mit.edu!llnews!53ab2750!not-for-mail
From: Peter Boettcher <boettcher@ll.mit.edu>
Newsgroups: comp.soft-sys.matlab
Subject: Re: sort large amount of data
References: <gl6ojl$1ag$1@fred.mathworks.com>
	<0d6ff837-2778-4524-a1af-5753c05b9b99@r37g2000prr.googlegroups.com>
Message-ID: <muy63k8s6d3.fsf@G99-Boettcher.llan.ll.mit.edu>
Organization: MIT Lincoln Laboratory
User-Agent: Gnus/5.110006 (No Gnus v0.6) Emacs/23.0.0 (gnu/linux)
Cancel-Lock: sha1:ski7DvN3h/k1xDBpItB0LJcNinY=
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Lines: 69
Date: Wed, 21 Jan 2009 16:21:12 -0500
NNTP-Posting-Host: 155.34.163.93
X-Complaints-To: news@ll.mit.edu
X-Trace: llnews 1232572872 155.34.163.93 (Wed, 21 Jan 2009 16:21:12 EST)
NNTP-Posting-Date: Wed, 21 Jan 2009 16:21:12 EST
Xref: news.mathworks.com comp.soft-sys.matlab:513042


Rune Allnor <allnor@tele.ntnu.no> writes:

> On 21 Jan, 10:07, "Theodor Zouk" <reb...@hotmail.com> wrote:
>> Hi I have very large amount of data (reell values of class double)
>> that is registrered over the chronological appearence and saved in
>> many .mat files. I want to sort this data in a ascending order. Now,
>> SORT do this, but it can only do it per each .mat-file. The ideal
>> would be if all the .mat files could be saved into one .mat file and
>> then loaded into workspace and then sorted. But this can not be done
>> when there is not enough memory. Any tips how i should 'loop' through
>> all the mat files and then at the end have many mat files where the
>> maximum value in one mat file is lower than the minimum value of the
>> next consecutive mat file.
>
> Two ideas come to mind:
>
> 1) Loop through each file and build a *total* histogram
>    of the values that appear in the total data set.
>    Use this histogram to separate the data set by value
>    into intermediate files. Sort these files, and merge.
> 2) Find a copy of Knut's "The art of Computer Programming:
>    Sorting and Searching" and read.

Adding a few thoughts:

Do you absolutely need a full sort?  Could you stop at a histogram like
Rune suggests?

Do you have a priori knowledge of the range of the values, or even
better the approximate distribution?  If you know the distribution, you
can skip the empirical histogram step and just toss all the values into
bins that you determine from your known distribution.

I would use raw binary files for the scratch space.  There will be much
less overhead than MAT files, and seemless appending of new values.


% Skip this if you can say something about your distribution
for n=1:num_files
  % read file n
  % build histogram, accumulate into "running" histogram.  bins must
  %  match
end

% choose number M of scratch files, set "edges" that separate files
%  equally
%  (see "histogram equalization", an image processing technique, for
%  help here)

for m=1:M
  % fopen scratch files
end

for n=1:num_files
  % read file n
  % separate into file bins (use histc)
  % append data sets to all M scratch files
end

for m=1:M
  % rewind and read full file m
  % sort
  % output
  % close file m
end



-Peter