Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
sort large amount of data

Subject: sort large amount of data

From: Theodor Zouk

Date: 21 Jan, 2009 09:07:01

Message: 1 of 5

Hi
I have very large amount of data (reell values of class double) that is registrered over the chronological appearence and saved in many .mat files. I want to sort this data in a ascending order. Now, SORT do this, but it can only do it per each .mat-file. The ideal would be if all the .mat files could be saved into one .mat file and then loaded into workspace and then sorted. But this can not be done when there is not enough memory. Any tips how i should 'loop' through all the mat files and then at the end have many mat files where the maximum value in one mat file is lower than the minimum value of the next consecutive mat file.

Best regards
Theo

Subject: sort large amount of data

From: Rune Allnor

Date: 21 Jan, 2009 10:27:10

Message: 2 of 5

On 21 Jan, 10:07, "Theodor Zouk" <reb...@hotmail.com> wrote:
> Hi
> I have very large amount of data (reell values of class double) that is r=
egistrered over the chronological appearence and saved in many .mat files. =
I want to sort this data in a ascending order. Now, SORT do this, but it ca=
n only do it per each .mat-file. The ideal would be if all the .mat files c=
ould be saved into one .mat file and then loaded into workspace and then so=
rted. But this can not be done when there is not enough memory. Any tips ho=
w i should 'loop' through all the mat files and then at the end have many m=
at files where the maximum value in one mat file is lower than the minimum =
value of the next consecutive mat file.

Two ideas come to mind:

1) Loop through each file and build a *total* histogram
   of the values that appear in the total data set.
   Use this histogram to separate the data set by value
   into intermediate files. Sort these files, and merge.
2) Find a copy of Knut's "The art of Computer Programming:
   Sorting and Searching" and read.

Rune

Subject: sort large amount of data

From: Roger Stafford

Date: 21 Jan, 2009 18:23:02

Message: 3 of 5

"Theodor Zouk" <rebet4@hotmail.com> wrote in message <gl6ojl$1ag$1@fred.mathworks.com>...
> Hi
> I have very large amount of data (reell values of class double) that is registrered over the chronological appearence and saved in many .mat files. I want to sort this data in a ascending order. Now, SORT do this, but it can only do it per each .mat-file. The ideal would be if all the .mat files could be saved into one .mat file and then loaded into workspace and then sorted. But this can not be done when there is not enough memory. Any tips how i should 'loop' through all the mat files and then at the end have many mat files where the maximum value in one mat file is lower than the minimum value of the next consecutive mat file.
>
> Best regards
> Theo

  In merge sorting for example, the basic step is a comparison between the next two elements of two arrays to see which is smaller, with the smaller being selected for insertion in another array and being removed (in a sense) from the original array. Assuming the two original arrays were already sorted, that results in a single combined array that is sorted. If this is repeatedly performed, sorted arrays starting from single elements and working up to an entire large sorted array can be obtained. It is an efficient order n*log2(n) algorithm.

  It seems to me that you can emulate such a procedure with your .mat files being considered as analogous to individual elements within "superarrays" which are not stored in memory simultaneously. Starting with all the .mat files being sorted within themselves, you can bring in two .mat sorted files, merge them into two equally large .mat files which are mutually sorted in the sense that each element of one is less than each element of the other. Now you have the analogy to two sorted elements. By continuing this you eventually have all .mat files paired off into mutually sorted pairs. The next stage is to do the same kind of "supermerging" to obtain sets of four .mat files which are mutually sorted, always taking care not to have more than four .mat files in computer memory at any one time. Eventually all the .mat files will represent one entire sorted array, though of course it will be too large to be stored in memory.

  The sizes of these .mat files are dictated by the requirement that no more than four of them will ever be allowed to reside in memory at the same time. Obviously there is the extra overhead of reading and writing numerous .mat files, but hopefully this should be no more time-consuming and maybe a lot less than the internal sorting of very long .mat file pairs.

  I hope all that makes sense to you, Theo. I can envision it more easily than I can explain it.

Roger Stafford

Subject: sort large amount of data

From: Peter Boettcher

Date: 21 Jan, 2009 21:21:12

Message: 4 of 5

Rune Allnor <allnor@tele.ntnu.no> writes:

> On 21 Jan, 10:07, "Theodor Zouk" <reb...@hotmail.com> wrote:
>> Hi I have very large amount of data (reell values of class double)
>> that is registrered over the chronological appearence and saved in
>> many .mat files. I want to sort this data in a ascending order. Now,
>> SORT do this, but it can only do it per each .mat-file. The ideal
>> would be if all the .mat files could be saved into one .mat file and
>> then loaded into workspace and then sorted. But this can not be done
>> when there is not enough memory. Any tips how i should 'loop' through
>> all the mat files and then at the end have many mat files where the
>> maximum value in one mat file is lower than the minimum value of the
>> next consecutive mat file.
>
> Two ideas come to mind:
>
> 1) Loop through each file and build a *total* histogram
> of the values that appear in the total data set.
> Use this histogram to separate the data set by value
> into intermediate files. Sort these files, and merge.
> 2) Find a copy of Knut's "The art of Computer Programming:
> Sorting and Searching" and read.

Adding a few thoughts:

Do you absolutely need a full sort? Could you stop at a histogram like
Rune suggests?

Do you have a priori knowledge of the range of the values, or even
better the approximate distribution? If you know the distribution, you
can skip the empirical histogram step and just toss all the values into
bins that you determine from your known distribution.

I would use raw binary files for the scratch space. There will be much
less overhead than MAT files, and seemless appending of new values.


% Skip this if you can say something about your distribution
for n=1:num_files
  % read file n
  % build histogram, accumulate into "running" histogram. bins must
  % match
end

% choose number M of scratch files, set "edges" that separate files
% equally
% (see "histogram equalization", an image processing technique, for
% help here)

for m=1:M
  % fopen scratch files
end

for n=1:num_files
  % read file n
  % separate into file bins (use histc)
  % append data sets to all M scratch files
end

for m=1:M
  % rewind and read full file m
  % sort
  % output
  % close file m
end



-Peter

Subject: sort large amount of data

From: Theodor Zouk

Date: 23 Jan, 2009 09:55:04

Message: 5 of 5

Peter Boettcher <boettcher@ll.mit.edu> wrote in message <muy63k8s6d3.fsf@G99-Boettcher.llan.ll.mit.edu>...
> Rune Allnor <allnor@tele.ntnu.no> writes:
>
> > On 21 Jan, 10:07, "Theodor Zouk" <reb...@hotmail.com> wrote:
> >> Hi I have very large amount of data (reell values of class double)
> >> that is registrered over the chronological appearence and saved in
> >> many .mat files. I want to sort this data in a ascending order. Now,
> >> SORT do this, but it can only do it per each .mat-file. The ideal
> >> would be if all the .mat files could be saved into one .mat file and
> >> then loaded into workspace and then sorted. But this can not be done
> >> when there is not enough memory. Any tips how i should 'loop' through
> >> all the mat files and then at the end have many mat files where the
> >> maximum value in one mat file is lower than the minimum value of the
> >> next consecutive mat file.
> >
> > Two ideas come to mind:
> >
> > 1) Loop through each file and build a *total* histogram
> > of the values that appear in the total data set.
> > Use this histogram to separate the data set by value
> > into intermediate files. Sort these files, and merge.
> > 2) Find a copy of Knut's "The art of Computer Programming:
> > Sorting and Searching" and read.
>
> Adding a few thoughts:
>
> Do you absolutely need a full sort? Could you stop at a histogram like
> Rune suggests?
>
> Do you have a priori knowledge of the range of the values, or even
> better the approximate distribution? If you know the distribution, you
> can skip the empirical histogram step and just toss all the values into
> bins that you determine from your known distribution.
>
> I would use raw binary files for the scratch space. There will be much
> less overhead than MAT files, and seemless appending of new values.
>
>
> % Skip this if you can say something about your distribution
> for n=1:num_files
> % read file n
> % build histogram, accumulate into "running" histogram. bins must
> % match
> end
>
> % choose number M of scratch files, set "edges" that separate files
> % equally
> % (see "histogram equalization", an image processing technique, for
> % help here)
>
> for m=1:M
> % fopen scratch files
> end
>
> for n=1:num_files
> % read file n
> % separate into file bins (use histc)
> % append data sets to all M scratch files
> end
>
> for m=1:M
> % rewind and read full file m
> % sort
> % output
> % close file m
> end
>
>
>
> -Peter

Thank you all for the tips and information!

Answers to some question:

> Do you absolutely need a full sort?
yes I do.

> Do you have a priori knowledge of the range of the values, or even
> better the approximate distribution?
Yes and no. I can get the min and max value for the entire data. I assume the distribution is unknown. Though it can be approximated (but it takes time).

My solution is this:

Instead of saving the data to many .mat files Im saving them into one binary file (FOPEN,FWRITE).
Im using MEMMAPFILE to be able to index this large file (~5GB). I will then take the median from the data range/intervall and set it as an threshold, loop through the data and save the sorted data into two new files. Now, I now for instance that SORT function can be applied through MEMMAPFILE function at an certain size of x number of MB of data (depending on the platform). So when making these new files Im also testing them if they are less or equal of x MB. if so the file is sorted. If not, the file is searched for its min and max value and a new threshold is set and two new files are made and tested and so on.

I dont know if it is faster to first approximate the distribution and then select the threshold(s)?

Thanks again for your tips!
Best regards
Theo

Tags for this Thread

No tags are associated with this thread.

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us