Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
very large array

Subject: very large array

From: Lorenzo Quadri

Date: 24 Jun, 2013 13:26:08

Message: 1 of 19

Hi I'm a newbie in matlab, I have a very large array (circa 400 million rows and 7 columns of uint8 type), I've to delete about 100 million elements i tried this kind of operation but it's very very slow.

for i=1:length(dati)
    if (( int ( sum(dati(i,:) ))<355) & range(dati(i,:))>20)
        dati(i,:) = [];
    end
end

I tried to copy elements in an other array but is slow too.
Thx

Subject: very large array

From: dpb

Date: 24 Jun, 2013 13:48:31

Message: 2 of 19

On 6/24/2013 8:26 AM, Lorenzo Quadri wrote:
> Hi I'm a newbie in matlab, I have a very large array (circa 400 million
> rows and 7 columns of uint8 type), I've to delete about 100 million
> elements i tried this kind of operation but it's very very slow.
>
> for i=1:length(dati)
> if (( int ( sum(dati(i,:) ))<355) & range(dati(i,:))>20)
> dati(i,:) = [];
> end
> end
>
> I tried to copy elements in an other array but is slow too.
> Thx

ix=sum(dat,2)<355 & range(dat,2)>20;
dat(ix,:)=[];

--

Subject: very large array

From: dpb

Date: 24 Jun, 2013 14:11:21

Message: 3 of 19

On 6/24/2013 8:26 AM, Lorenzo Quadri wrote:
> Hi I'm a newbie in matlab, I have a very large array (circa 400 million
> rows and 7 columns of uint8 type), I've to delete about 100 million
...

To do w/o loop see other posting; I'll just note that the above is ~2.6
GB so it's likely going to be slow no matter what...I presume you're
running a 64-bit OS and version of ML? What does the

memory

command at the command line indicate when you have this puppy loaded,
out of curiosity?

I'm wondering if you'd have better performance if you didn't load but
10% or so at a time and did the operation on it.

What do you propose to do w/ such a large dataset once you're done,
anyway? What's in the last .1 GB that wasn't available in the first as
far as information?

--

Subject: very large array

From: Steven_Lord

Date: 24 Jun, 2013 14:19:40

Message: 4 of 19



"Lorenzo Quadri" <quadrilo_sub_r@gmail.com> wrote in message
news:kq9hdf$bsa$1@newscl01ah.mathworks.com...
> Hi I'm a newbie in matlab, I have a very large array (circa 400 million
> rows and 7 columns of uint8 type), I've to delete about 100 million
> elements i tried this kind of operation but it's very very slow.
>
> for i=1:length(dati)
> if (( int ( sum(dati(i,:) ))<355) & range(dati(i,:))>20)
> dati(i,:) = [];
> end
> end

Not only will this be slow, it will also error. If you have a 10 row array:

xv = (1:10).';
X = [xv, xv.^2]
size(X)

and you delete one row, you now have a 9 row array:

X(3, :) = []
size(X)

In your code, length(dati) is NOT evaluated each time the loop body executes
but is fixed when the loop STARTS executing. Thus you'd walk off the end of
the array if any of the rows are deleted.

So you want to eliminate rows whose sum is less than 355 and whose maximum
and minimum elements are more than 20 apart? Use logical indexing on the
whole array at once rather than row-by-row. Compute along the rows by
specifying a dimension input argument to SUM and RANGE.

rowsums = sum(dati, 2, 'double');
rowranges = range(dati, 2);
dati(rowsums < 355 & rowranges > 20, :) = [];

While you could do this all on one line, I broke the two conditions out so
you could experiment with a smaller dati to prove to yourself that it works
and that you understand what the code is doing.

--
Steve Lord
slord@mathworks.com
To contact Technical Support use the Contact Us link on
http://www.mathworks.com

Subject: very large array

From: Lorenzo Quadri

Date: 24 Jun, 2013 15:02:07

Message: 5 of 19

"Steven_Lord" <slord@mathworks.com> wrote in message
>
> Not only will this be slow, it will also error. If you have a 10 row array:
>
> xv = (1:10).';
> X = [xv, xv.^2]
> size(X)
>
> and you delete one row, you now have a 9 row array:
>
> X(3, :) = []
> size(X)
>
> In your code, length(dati) is NOT evaluated each time the loop body executes
> but is fixed when the loop STARTS executing. Thus you'd walk off the end of
> the array if any of the rows are deleted.
>
> So you want to eliminate rows whose sum is less than 355 and whose maximum
> and minimum elements are more than 20 apart? Use logical indexing on the
> whole array at once rather than row-by-row. Compute along the rows by
> specifying a dimension input argument to SUM and RANGE.
>
> rowsums = sum(dati, 2, 'double');
> rowranges = range(dati, 2);
> dati(rowsums < 355 & rowranges > 20, :) = [];
>
> While you could do this all on one line, I broke the two conditions out so
> you could experiment with a smaller dati to prove to yourself that it works
> and that you understand what the code is doing.
>
> --
> Steve Lord
> slord@mathworks.com
> To contact Technical Support use the Contact Us link on
> http://www.mathworks.com

Yes, thank you, your comment is right,
I replaced the for loop with a while one
like:

a=length(dati)

while (a>0)
    if((sum(dati(i,:))<355) & range(dati(i,:))>20)
        a = a - 1;
        dati(i,:) = [];
    end
end

I'll try evaluate it with logical indexing as you suggested.
Copy the array in an other one take about 1,5 hrs every 1000000 iterations
so the whole operation is exstimated in about 300 hrs or 12,5 days (too much, conditions change meanwhile).


thank you very much

Subject: very large array

From: Lorenzo Quadri

Date: 24 Jun, 2013 15:11:07

Message: 6 of 19

"Lorenzo Quadri" <quadrilo_sub_r@gmail.com> wrote in message <kq9n1f$smu$1@newscl01ah.mathworks.com>...
> "Steven_Lord" <slord@mathworks.com> wrote in message
> >
> > Not only will this be slow, it will also error. If you have a 10 row array:
> >
> > xv = (1:10).';
> > X = [xv, xv.^2]
> > size(X)
> >
> > and you delete one row, you now have a 9 row array:
> >
> > X(3, :) = []
> > size(X)
> >
> > In your code, length(dati) is NOT evaluated each time the loop body executes
> > but is fixed when the loop STARTS executing. Thus you'd walk off the end of
> > the array if any of the rows are deleted.
> >
> > So you want to eliminate rows whose sum is less than 355 and whose maximum
> > and minimum elements are more than 20 apart? Use logical indexing on the
> > whole array at once rather than row-by-row. Compute along the rows by
> > specifying a dimension input argument to SUM and RANGE.
> >
> > rowsums = sum(dati, 2, 'double');
> > rowranges = range(dati, 2);
> > dati(rowsums < 355 & rowranges > 20, :) = [];
> >
> > While you could do this all on one line, I broke the two conditions out so
> > you could experiment with a smaller dati to prove to yourself that it works
> > and that you understand what the code is doing.
> >
> > --
> > Steve Lord
> > slord@mathworks.com
> > To contact Technical Support use the Contact Us link on
> > http://www.mathworks.com
>
> Yes, thank you, your comment is right,
> I replaced the for loop with a while one
> like:
>
> a=length(dati)
>
> while (a>0)
> if((sum(dati(i,:))<355) & range(dati(i,:))>20)
> a = a - 1;
> dati(i,:) = [];
> end
> end
>
> I'll try evaluate it with logical indexing as you suggested.
> Copy the array in an other one take about 1,5 hrs every 1000000 iterations
> so the whole operation is exstimated in about 300 hrs or 12,5 days (too much, conditions change meanwhile).
>
>
> thank you very much

ops sorry

while (a>0)
    if((sum(dati(i,:))<355) & range(dati(i,:))>20)
        dati(i,:) = [];
    end
    a = a - 1;
end

Subject: very large array

From: dpb

Date: 24 Jun, 2013 16:13:24

Message: 7 of 19

On 6/24/2013 10:11 AM, Lorenzo Quadri wrote:
> "Lorenzo Quadri" <quadrilo_sub_r@gmail.com> wrote in message
> <kq9n1f$smu$1@newscl01ah.mathworks.com>...
>> "Steven_Lord" <slord@mathworks.com> wrote in message > > Not only will
>> this be slow, it will also error. If you have a 10 row array:
...
>> you'd walk off the end of the array if any of the rows are deleted.
>...

>> I replaced the for loop with a while one like:
>>
>> a=length(dati)
>> while (a>0)
>> if((sum(dati(i,:))<355) & range(dati(i,:))>20)
>> a = a - 1;
>> dati(i,:) = [];
>> end
>> end
>>
>> I'll try evaluate it with logical indexing as you suggested.
>> Copy the array in an other one take about 1,5 hrs every 1000000
>> iterations
>> so the whole operation is exstimated in about 300 hrs or 12,5 days
>> (too much, conditions change meanwhile).
>>
>>
>> thank you very much
>
> ops sorry
>
> while (a>0)
> if((sum(dati(i,:))<355) & range(dati(i,:))>20)
> dati(i,:) = [];
> end
> a = a - 1;
> end

Now you're not incrementing i so you'll process the same row over and
over and over...

While doing it by a loop is _NOT_ the way for large cases, sometimes it
is handy and time isn't an issue for small array sizes. The way in
general to do such things is to start at the end and progress
forwards--that way the lower indices aren't affected by the deleted rows...

for i=length(dat):-1:1
   if((sum(dati(i,:))<355) & range(dati(i,:))>20)
     dati(i,:) = [];
   end
end

Now it won't run off the end (but it will still run a _loooong_ time,
methinks...

See other comments on "why" and perhaps a more-quicker way of breaking
it into chunks if you're causing memory paging w/ it in one full array
in (virtual) memory.

--

Subject: very large array

From: Lorenzo Quadri

Date: 24 Jun, 2013 16:27:25

Message: 8 of 19

dpb <none@non.net> wrote in message <kq9r5n$21t$1@speranza.aioe.org>...

> Now you're not incrementing i so you'll process the same row over and
> over and over...
>
> While doing it by a loop is _NOT_ the way for large cases, sometimes it
> is handy and time isn't an issue for small array sizes. The way in
> general to do such things is to start at the end and progress
> forwards--that way the lower indices aren't affected by the deleted rows...
>
> for i=length(dat):-1:1
> if((sum(dati(i,:))<355) & range(dati(i,:))>20)
> dati(i,:) = [];
> end
> end
>
> Now it won't run off the end (but it will still run a _loooong_ time,
> methinks...
>
> See other comments on "why" and perhaps a more-quicker way of breaking
> it into chunks if you're causing memory paging w/ it in one full array
> in (virtual) memory.
>

Thanks for your answer :),
I'm actually decreasing the variable i (certainly not in that piece of code that I posted for example excuse me for such example :)) ) Seems to me that in reality it takes a a lot to decrease the array, it seems that every time rebuilds the index or a sort of it.

Thanks a lot

Subject: very large array

From: Lorenzo Quadri

Date: 24 Jun, 2013 17:47:14

Message: 9 of 19

dpb <none@non.net> wrote in message <kq9k0t$bg6$1@speranza.aioe.org>...



> To do w/o loop see other posting; I'll just note that the above is ~2.6
> GB so it's likely going to be slow no matter what...I presume you're
> running a 64-bit OS and version of ML? What does the
>

Yes
 
> memory
>

I' ve 16 gb phisical memory and 50 gb virtual
but deleting some rows in the 600.000.000x5 array is a very painfull experience
(at least when 16 gb phisical mem is occupied)

how to chuncks programatically the array in 60 10 million arrays?

 

Subject: very large array

From: dpb

Date: 24 Jun, 2013 21:14:47

Message: 10 of 19

On 6/24/2013 11:27 AM, Lorenzo Quadri wrote:
...

> Thanks for your answer :), I'm actually decreasing the variable i
> (certainly not in that piece of code that I posted for example excuse me
> for such example :)) )

Oh...ok. Can only comment on what can see... :)

> Seems to me that in reality it takes a a lot to
> decrease the array, it seems that every time rebuilds the index or a
> sort of it.

That's a figment of Matlab's memory allocation and being interpreted
rather than compiled altho you'd have to do things carefully or the
compiler would have to be able to "see" and recognize a final memory
reallocation could wait until the end instead of "squeezing" the empty
memory out on each pass through the loop.

As noted, the way would be to use the logical addressing instead of the
loop altho that gets again back to how well the compiler can avoid
unneeded moves.

--

Subject: very large array

From: dpb

Date: 24 Jun, 2013 21:20:39

Message: 11 of 19

On 6/24/2013 12:47 PM, Lorenzo Quadri wrote:
> dpb <none@non.net> wrote in message <kq9k0t$bg6$1@speranza.aioe.org>...
>
>
>
>> To do w/o loop see other posting; I'll just note that the above is
>> ~2.6 GB so it's likely going to be slow no matter what...I presume
>> you're running a 64-bit OS and version of ML? What does the
>>
>
> Yes
>> memory
>>
>
> I' ve 16 gb phisical memory and 50 gb virtual
> but deleting some rows in the 600.000.000x5 array is a very painfull
> experience (at least when 16 gb phisical mem is occupied)
>
> how to chuncks programatically the array in 60 10 million arrays?

Is the file formatted or unformatted? It would process faster if
unformatted if you can generate it in that fashion.

Basic idea is to read in some N number of lines/records and do the
reduction on them where N is, say, 100k or so. Do you need all of these
data in memory at once or what are you doing with them?

Oh...another question--can you do the logic discrimination test when the
data are first created/read/whatever_it_is_that_gets_them so that you
don't have to post-process the humongous file to eliminate them later?
The fastest code is that not executed... :)

--

Subject: very large array

From: James Tursa

Date: 24 Jun, 2013 23:26:08

Message: 12 of 19

dpb <none@non.net> wrote in message <kq9r5n$21t$1@speranza.aioe.org>...
>
> While doing it by a loop is _NOT_ the way for large cases, sometimes it
> is handy and time isn't an issue for small array sizes. The way in
> general to do such things is to start at the end and progress
> forwards--that way the lower indices aren't affected by the deleted rows...
>
> for i=length(dat):-1:1
> if((sum(dati(i,:))<355) & range(dati(i,:))>20)
> dati(i,:) = [];
> end
> end

I understand what you are saying about using a loop on small datasets, but IMO this is never a good way to program in MATLAB, even on small datasets. The statement dati(i,:) = [] causes the entire data of dati (minus one row) to be copied to another brand new part of memory. Doing so in a loop can easily dominate the running time. For small sizes you don't notice it of course, but for large cases it can easily consume 99.99% of the running time. I would always opt for logical indexing to mark the data you want deleted, then delete it all in one fell swoop. That way the data is only copied once, not a gazillion times. Even if you know the dataset is small, there is the chance you would reuse the code later on for a large dataset and *forget* you had that inefficient loop in there. Bottom line is I would always avoid the loop construct ... it is the same as building up an array size in a
loop. Use a loop to mark the rows if you have to (or use a one-liner if you can), but don't do the data deletion itself in the loop.

James Tursa

Subject: very large array

From: dpb

Date: 25 Jun, 2013 02:19:51

Message: 13 of 19

On 6/24/2013 6:26 PM, James Tursa wrote:
> dpb <none@non.net> wrote in message <kq9r5n$21t$1@speranza.aioe.org>...
>>
>> While doing it by a loop is _NOT_ the way for large cases, sometimes
>> it is handy and time isn't an issue for small array sizes. The way in
>> general to do such things is to start at the end and progress
>> forwards--that way the lower indices aren't affected by the deleted
>> rows...
>>
>> for i=length(dat):-1:1
>> if((sum(dati(i,:))<355) & range(dati(i,:))>20)
>> dati(i,:) = [];
>> end
>> end
>
> I understand what you are saying about using a loop on small datasets,
> but IMO this is never a good way to program in MATLAB, even on small
> datasets....

I don't recall saying it was a "good" way...just that it won't be
terribly noticeable on small datasets.

And, yes, no, I'd never program that way in Matlab unless I had some
ulterior motive as in this posting as a pedagogical tool...

--

Subject: very large array

From: Lorenzo Quadri

Date: 25 Jun, 2013 06:18:18

Message: 14 of 19

dpb <none@non.net> wrote in message <kqaumh$4ko$1@speranza.aioe.org>...
> On 6/24/2013 6:26 PM, James Tursa wrote:
> > dpb <none@non.net> wrote in message <kq9r5n$21t$1@speranza.aioe.org>...
> >>
> >> While doing it by a loop is _NOT_ the way for large cases, sometimes
> >> it is handy and time isn't an issue for small array sizes. The way in
> >> general to do such things is to start at the end and progress
> >> forwards--that way the lower indices aren't affected by the deleted
> >> rows...
> >>
> >> for i=length(dat):-1:1
> >> if((sum(dati(i,:))<355) & range(dati(i,:))>20)
> >> dati(i,:) = [];
> >> end
> >> end
> >
> > I understand what you are saying about using a loop on small datasets,
> > but IMO this is never a good way to program in MATLAB, even on small
> > datasets....
>
> I don't recall saying it was a "good" way...just that it won't be
> terribly noticeable on small datasets.
>
> And, yes, no, I'd never program that way in Matlab unless I had some
> ulterior motive as in this posting as a pedagogical tool...
>
> --

I can't do logic discrimination on dataset on loading because is created by an other program as complete array (matlab workspace format).
I can iterate the whole dataset and mark zeros the rows I want to delete, but then I've to delete the rows in possible fast way.

Subject: very large array

From: James Tursa

Date: 25 Jun, 2013 06:32:07

Message: 15 of 19

"Lorenzo Quadri" <quadrilo_sub_r@gmail.com> wrote in message <kqbcna$93q$1@newscl01ah.mathworks.com>...
> dpb <none@non.net> wrote in message <kqaumh$4ko$1@speranza.aioe.org>...
> > On 6/24/2013 6:26 PM, James Tursa wrote:
> > > dpb <none@non.net> wrote in message <kq9r5n$21t$1@speranza.aioe.org>...
> > >>
> > >> While doing it by a loop is _NOT_ the way for large cases, sometimes
> > >> it is handy and time isn't an issue for small array sizes. The way in
> > >> general to do such things is to start at the end and progress
> > >> forwards--that way the lower indices aren't affected by the deleted
> > >> rows...
> > >>
> > >> for i=length(dat):-1:1
> > >> if((sum(dati(i,:))<355) & range(dati(i,:))>20)
> > >> dati(i,:) = [];
> > >> end
> > >> end
> > >
> > > I understand what you are saying about using a loop on small datasets,
> > > but IMO this is never a good way to program in MATLAB, even on small
> > > datasets....
> >
> > I don't recall saying it was a "good" way...just that it won't be
> > terribly noticeable on small datasets.
> >
> > And, yes, no, I'd never program that way in Matlab unless I had some
> > ulterior motive as in this posting as a pedagogical tool...
> >
> > --
>
> I can't do logic discrimination on dataset on loading because is created by an other program as complete array (matlab workspace format).
> I can iterate the whole dataset and mark zeros the rows I want to delete, but then I've to delete the rows in possible fast way.

The fastest way is likely what dpb already posted. The vectorized logical index method:

ix=sum(dat,2)<355 & range(dat,2)>20;
dat(ix,:)=[];

James Tursa

Subject: very large array

From: dpb

Date: 25 Jun, 2013 12:35:59

Message: 16 of 19

On 6/25/2013 1:18 AM, Lorenzo Quadri wrote:
...

> I can't do logic discrimination on dataset on loading because is created
> by an other program as complete array (matlab workspace format). I can
> iterate the whole dataset and mark zeros the rows I want to delete, but
> then I've to delete the rows in possible fast way.

Can you not change the other program or run the other program in smaller
sections or have it output intermediary results on occasion or pipe the
results during execution or...???

If you're insistent on trying to do such an operation on such a large
dataset in on swell foop it's bound to be slow...

I've no idea whether ML w/ some assistance could manage to make use of
multiple cores somehow or not--

--

Subject: very large array

From: Lorenzo Quadri

Date: 25 Jun, 2013 17:26:08

Message: 17 of 19

"James Tursa" wrote in message <kqbdh7$amk$1@newscl01ah.mathworks.com>...
> "Lorenzo Quadri" <quadrilo_sub_r@gmail.com> wrote in message <kqbcna$93q$1@newscl01ah.mathworks.com>...
> > dpb <none@non.net> wrote in message <kqaumh$4ko$1@speranza.aioe.org>...
> > > On 6/24/2013 6:26 PM, James Tursa wrote:
> > > > dpb <none@non.net> wrote in message <kq9r5n$21t$1@speranza.aioe.org>...
> > > >>
> > > >> While doing it by a loop is _NOT_ the way for large cases, sometimes
> > > >> it is handy and time isn't an issue for small array sizes. The way in
> > > >> general to do such things is to start at the end and progress
> > > >> forwards--that way the lower indices aren't affected by the deleted
> > > >> rows...
> > > >>
> > > >> for i=length(dat):-1:1
> > > >> if((sum(dati(i,:))<355) & range(dati(i,:))>20)
> > > >> dati(i,:) = [];
> > > >> end
> > > >> end
> > > >
> > > > I understand what you are saying about using a loop on small datasets,
> > > > but IMO this is never a good way to program in MATLAB, even on small
> > > > datasets....
> > >
> > > I don't recall saying it was a "good" way...just that it won't be
> > > terribly noticeable on small datasets.
> > >
> > > And, yes, no, I'd never program that way in Matlab unless I had some
> > > ulterior motive as in this posting as a pedagogical tool...
> > >
> > > --
> >
> > I can't do logic discrimination on dataset on loading because is created by an other program as complete array (matlab workspace format).
> > I can iterate the whole dataset and mark zeros the rows I want to delete, but then I've to delete the rows in possible fast way.
>
> The fastest way is likely what dpb already posted. The vectorized logical index method:
>
> ix=sum(dat,2)<355 & range(dat,2)>20;
> dat(ix,:)=[];
>
> James Tursa

Thanks a lot works greatly, about 20 seconds to delete 300000000 rows
thx dpb and tursa

Subject: very large array

From: dpb

Date: 25 Jun, 2013 20:13:33

Message: 18 of 19

On 6/25/2013 12:26 PM, Lorenzo Quadri wrote:
> "James Tursa" wrote in message <kqbdh7$amk$1@newscl01ah.mathworks.com>...
...

>> > > I can't do logic discrimination on dataset on loading because is
>> created by an other program as complete array ...
>> ... but then I've to delete the rows in possible fast way.
>> The fastest way is likely what dpb already posted. The vectorized
>> logical index method:
>>
>> ix=sum(dat,2)<355 & range(dat,2)>20;
>> dat(ix,:)=[];
>>
>> James Tursa
>
> Thanks a lot works greatly, about 20 seconds to delete 300000000 rows
> thx dpb and tursa

I'm surprised it's that fast...but I've an old, slow 32-big machine by
today's standards...

--

Subject: very large array

From: Lorenzo Quadri

Date: 26 Jun, 2013 06:53:07

Message: 19 of 19

dpb <none@non.net> wrote in message <kqctlb$kas$1@speranza.aioe.org>...

> I'm surprised it's that fast...but I've an old, slow 32-big machine by
> today's standards...
>
> --

I believe that it's an operation that requiring much less memory, I've 16 gb phisical mem
and the problems start if you have to use the virtual memory. I noticed that the processor time is never stressed so much (and Ml does not seem to use the multicore efficiently)
thx

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us