Asked by angainor
on 22 Nov 2012

This is a rather lengthy question, previously asked on http://stackoverflow.com/questions/13382155/is-indexing-vectors-in-matlab-inefficient. Since I did not get any definitive answer, I decided to ask it here, where it is more likely that MATLAB engineers will look.

**Background**

My question is motivated by simple observations, which somewhat undermine the beliefs/assumptions often held/made by experienced MATLAB users:

- MATLAB is very well optimized when it comes to the built-in functions and the fundamental language features, such as indexing vectors and matrices.
- Loops in MATLAB are slow (despite the JIT) and should generally be avoided if the algorithm can be expressed in a native, 'vectorized' manner.

The bottom line: core MATLAB functionality is efficient and trying to outperform it using MATLAB code is hard, if not impossible.

**Investigating performance of vector indexing**

The example codes shown below are as fundamental as it gets: I assign a scalar value to all vector entries. First, I allocate an empty vector `x`:

tic; x = zeros(1e8,1); toc Elapsed time is 0.260525 seconds.

Having `x` I would like to set all its entries to the same value. In practice you would do it differently, e.g., `x = value*ones(1e8,1)`, but the point here is to investigate the performance of vector indexing. The simplest way is to write:

tic; x(:) = 1; toc Elapsed time is 0.094316 seconds.

Let's call it method 1 (from the value assigned to `x`). It seems to be very fast (faster at least than memory allocation). Because the only thing I do here is operate on memory, I can estimate the efficiency of this code by calculating the obtained **effective memory bandwidth** and comparing it to the **hardware memory bandwidth** of my computer:

eff_bandwidth = numel(x) * 8 bytes per double * 2 / time

In the above, I multiply by `2` because unless SSE streaming is used, setting values in memory requires that the vector is both read from and written to the memory. In the above example:

eff_bandwidth(1) = 1e8*8*2/0.094316 = 17 Gb/s

STREAM-benchmarked memory bandwidth ( https://www.cs.virginia.edu/stream/ ) of my computer is around 17.9 Gb/s, so indeed - MATLAB delivers close to peak performance in this case! So far, so good.

Method 1 is suitable if you want to set **all** vector elements to some value. But if you want to access elements every `step` entries, you need to substitute the `:` with e.g., `1:step:end`. Below is a direct speed comparison with method 1:

tic; x(1:end) = 2; toc Elapsed time is 0.496476 seconds.

While you would not expect it to perform any different, method 2 is clearly big trouble: factor 5 slowdown for no reason. My suspicion is that in this case MATLAB explicitly allocates the index vector (|1:end|). This is somewhat confirmed by using explicit vector size instead of `end`:

tic; x(1:1e8) = 3; toc Elapsed time is 0.482083 seconds.

Methods 2 and 3 perform equally bad.

Another possibility is to explicitly create an index vector `id` and use it to index `x`. This gives you the most flexible indexing capabilities. In our case:

tic; id = 1:1e8; % colon(1,1e8); x(id) = 4; toc Elapsed time is 1.208419 seconds.

Now that is really something - 12 times slowdown compared to method 1! I understand it should perform worse than method 1 because of the additional memory used for `id`, but why is it so much worse than methods 2 and 3?

Let's give the loops a try - as hopeless as it may sound.

tic; for i=1:numel(x) x(i) = 5; end toc Elapsed time is 0.788944 seconds.

A big surprise - a loop beats a `vectorized` method 4, but is still slower than methods 1, 2 and 3. It turns out that in this particular case you can do it better:

tic; for i=1:1e8 x(i) = 6; end toc Elapsed time is 0.321246 seconds.

And that is the probably the most bizarre outcome of this study - **a MATLAB-written loop significantly outperforms native vector indexing**. That should certainly not be so. Note that the JIT'ed loop is still 3 times slower than the theoretical peak almost obtained by method 1. So there is still plenty of room for improvement. It is just surprising (a stronger word would be more suitable) that usual 'vectorized' indexing (`1:end`) is even slower.

**My attempt to get more insights into the problem**

I do not have an answer to all the problems, but I do have some refined speculations on methods 2, 3 and 4.

Regarding methods 2 and 3. It does indeed seem that MATLAB allocates memory for the vector indices and fills it with values from `1` to `1e8`. To understand it, lets see what is going on. By default, MATLAB uses `double` as its data type. Allocating the index array takes the same time as allocating `x`

tic; x = zeros(1e8,1); toc Elapsed time is 0.260525 seconds.

For now, the index array contains only zeros. Assigning values to the `x` vector in an optimal way, as in method 1, takes `0.094316` seconds. Now, the index vector has to be read from the memory so that it can be used in indexing. That is additional `0.094316/2` seconds. Recall that in `x(:)=1` vector `x` has to be both read from and written to the memory. So only reading it takes half the time. Assuming this is all that is done in `x(1:end)=value`, the total time of methods 2 and 3 should be

t = 0.260525+0.094316+0.094316/2 = 0.402

It is almost correct, but not quite. I can only speculate, but filling the index vector with values is probably done as an additional step and takes additional 0.094316 seconds. Hence, `t=0.4963`, which more or less fits with the time of methods 2 and 3.

These are only speculations, but they do seem to confirm that MATLAB **explicitly** creates index vectors when doing native vector indexing. **Personally, I consider this to be a performance bug**. MATLABs JIT compiler should be smart enough to understand this trivial construct and convert it to a call to a correct internal function. As it is now, on the todays memory bandwidth bounded architectures indexing performs at around 20% theoretical peak.

Regarding method 4, one can try to analyze the performance of the individual steps as follows:

tic; id = 1:1e8; % colon(1,1e8); toc tic x(id) = 4; toc

Elapsed time is 0.475243 seconds. Elapsed time is 0.763450 seconds.

The first step, allocation and filling the values of the index vector takes the same time as methods 2 and 3 alone. It seems that it is way too much - it should take at most the time needed to allocate the memory and to set the values ( `0.260525s+0.094316s = 0.3548s` ), so there is an additional overhead of `0.12` seconds somewhere, which I can not understand. The second part ( `x(id) = 4` ) looks also very inefficient: it should take the time needed to set the values of `x`, and to read the `id` vector ( `0.094316s+0.094316/2s = 0.1415s` ) plus some error checks on the `id` values. Programed in C, the two steps take:

id = 1:n 0.214259 x(id) = 4 0.219768

The code used checks that a `double` index in fact represents an integer, and that it fits the size of `x`:

#include <stdlib.h> #include <stdio.h> #include <string.h>

static struct timeval tb, te;

void tic() { gettimeofday(&tb, NULL); }

void toc(const char *idtxt) { long s,u; gettimeofday(&te, NULL); s=te.tv_sec-tb.tv_sec; u=te.tv_usec-tb.tv_usec; printf("%-30s%10li.%.6li\n", idtxt, (s*1000000+u)/1000000, (s*1000000+u)%1000000); }

int main(int argc, char *argv[]) { double *x = NULL; double *id = NULL; int i, n;

// vector size is a command line parameter n = atoi(argv[1]); printf("x size %i\n", n);

// not included in timing in MATLAB x = calloc(sizeof(double),n); memset(x, 0, sizeof(double)*n);

// create index vector tic(); id = malloc(sizeof(double)*n); for(i=0; i<n; i++) id[i] = i; toc("id = 1:n");

// use id to index x and set all entries to 4 tic(); for(i=0; i<n; i++) { long iid = (long)id[i]; if(iid>=0 && iid<n && (double)iid==id[i]){ x[iid] = 4; } else break; } toc("x(id) = 4"); }

The second step takes longer than the expected `0.1415s` - that is due to the necessity of error checks on `id` values. The overhead of the checks seems too large to me - most likely my code is NOT optimal and it could be written better. Even so, the time required is `0.4340s` , not `1.208419s`. What MATLAB does under the hood - I have no idea. Maybe it is necessary to do it, I just don't see it.

**Questions**

- is simple indexing in MATLAB
**very**inefficient (methods 2, 3, and 4 are slower than method 1), or did I miss something? - why is method 4
**(so much) slower**than methods 2 and 3? - why does using
`1e8`instead of`numel(x)`as a loop bound speed up the code by factor 2?

As I see it now, we can not do anything to improve the situation working within the MATLAB framework. Can we hope the issues will be fixed in some new versions of MATLAB? Or are the described performance problems not considered issues?

**Test code**

function test

tic; x = zeros(1,1e8); toc

tic; x(:) = 1; toc tic; x(1:end) = 2; toc tic; x(1:1e8) = 3; toc

tic; id = 1:1e8; % colon(1,1e8); x(id) = 4; toc

tic; for i=1:numel(x) x(i) = 5; end toc

tic; for i=1:1e8 x(i) = 6; end toc

end

**Edit** The following tests (modifications to methods 1, 2, and 3 using `subsasgn`) have been suggested by Daniel

S = struct('type', '()', 'subs', {{':'}}); tic; x = subsasgn(x, S, 1); toc

S = struct('type', '()', 'subs', {{'1:end'}}); tic; x = subsasgn(x, S, 2); toc

S = struct('type', '()', 'subs', {{'1:1e8'}}); tic; x = subsasgn(x, S, 3); toc

Elapsed time is 0.451859 seconds. Elapsed time is 0.363623 seconds. Elapsed time is 0.359140 seconds.

Now I am even more lost. Definitely, `x(:)` works bad now. But interestingly, methods 2 and 3 work faster than the direct implementation, which suggests that `subasgn` is not used directly when running `x(1:end)` and `x(1:1e8)`. Since my results differ from what Daniel observed, I should mention that I use Matlab 2011b on Linux/Ubuntu.

*No products are associated with this question.*

Answer by Matt J
on 22 Nov 2012

Edited by Matt J
on 22 Nov 2012

Never mind any of that. Try explaining this:

x = zeros(1e8,1);

id = 1:1e8; tic; for i=id x(i) = 4; end toc

%Elapsed time is 101.974125 seconds.

I think the bottom line is this. Vectorized methods are generally going to be slow because they always require you to generate and allocate memory for a dedicated index array. Some vectorized situations are accelerated more than others, however, like

x(1:1e8)=2

because MATLAB recognizes that 1:1e8 is only temporarily required and does a better job of caching it.

For-loops can in some cases do better than vectorized methods because the JIT can recognize certain patterns and situations where the loop is jumping in regular increments and doesn't need to allocate memory for a dedicated index array.

As for why numel(x) slows a for-loop down, I imagine the situation is similar to C/C++. At every iteration of the loop, i must be compared to numel(x), which is stored in some temporary variable, to check if the loop is supposed to terminate. Variables are not cached as well as constants like 1e8.

Show 8 older comments

angainor
on 24 Nov 2012

@Jan @Daniel **: I confirm this. x(1:1e8) does not create 1:1e8 explicitly.**

How did you check that? I **do not confirm that**. I just executed the statements while looking at MATLABs memory consumption and see that the memory usage **clearly goes up** by almost 800Mb. What matlab version are you using?

Matt J
on 24 Nov 2012

**I just executed the statements while looking at MATLABs memory consumption and see that the memory usage clearly goes up by almost 800Mb.**

Confirmed! (in R2012a 64-bit)

Well I guess the plot thickens.

**performance is optimal, hence it is possible and there is no inherent problem with MATLAB - it can do things well.**

Yes, it's possible and maybe it's on their TODO list, but the question is how much development work it would take TMW to make the optimization you're talking about. Basically what you seem to be proving is that all indexing operations, even on native matrices/vectors, generate index vectors and pass those vectors to SUBSASGN, except of course in the particular case where it passes only the string ':'. The reason it is possible to optimize x(:) is because when the ':' string gets passed to SUBSASGN, it can be easily recognized as a special case. What you're now asking is that the JIT recognize colon expressions separately and pass a new/different kind of symbol to subsasgn to encode that. Not sure how much development effort that would require or how much extra burden that would put on the JIT.

There are also many other sub-optimalities you could point out to TMW. What about the following performance difference, for example? Does it make sense that something so slightly different than x(:) should be about a factor of 2 slower?

tic; x(:)=2; toc; %Elapsed time is 0.054764 seconds.

tic; x(:,1)=3; toc; %Elapsed time is 0.098193 seconds.

angainor
on 24 Nov 2012

@Matt J **Does it make sense that something so slightly different than x(:) should be about a factor of 2 slower?**

No :) Hence my concern - indexing in MATLAB should in my opinion be re-designed and re-implemented. My question tried to point it out on the simplest possible example. I hope it is not on a todo list, but already in their working SVN branch ;)

About effort - why not put it here instead of re-writing the user interface, which was not that bad as it was? My right to complain as the customer :)

But really, I do think that indexing is so fundamental, that it needs some serious attention.

Answer by Jan Simon
on 23 Nov 2012

Edited by Jan Simon
on 23 Nov 2012

Loops can be faster than vectorized solutions, when the allocation of temporary arrays is more expensive thant the calculations.

Another difference between `id=1:1e8; x(id)=4` and `x(1:1e8)=4` is the range check: While in the in the first case Matlab checks 1e8 times if the index is inside the boundaries, for the 2nd case only two tests are performed.

Unfortunately this concerns logical indexing also, at least it looks like this. A naive C-Mex function, which checks the length of the logical index ones, is by a fixed noticable factor faster than doing this in Matlab directly. I'm going to post a version in the FEX.

**[EDITED]**

**Version 1:**

id = 1:1e8; x(id) = 4;

This creates 1:1e8 explicitly. In the assignment the elements pf x and the index vector are stored in the processor caches. In addition a boundary check is performed for each index (and this requires much more time with my Win64/MSVC2008 or MSVC2010 or OWC1.8 setup than in your measurements). ==> This is slow.

**Version 2:**

for i = 1:numel(x) x(i) = 5; end

Matlab's JIT accelerates this and the loops avoid the explicit creation of the index vector. In addition the (not existing) index vector does not pollute the caches. It is documented, that "numel(x)" is evaluated once only when the loop starts.

**Version 3:**

for i = 1:1e8 x(i) = 5; end

Obviously the JIT can use the fixed width to improve the speed compared to the version 2. Fine. Perhaps the boundary checks are omitted.

**Version 4:**

x(:) = 5;

Now Matlab can show its power: Because this cannot be affected by any side effects, the JIT calls an efficient method.

For an interpreted language it is not surprising, that the other methods do not have the theoretical memory bandwidth. The power of Matlab is **not** the performance, but the fast prototyping. This is useful, when the time for programming and debugging exceeds the run-time.

Therefore I do not see a performance **bug**. The JIT is not as powerful as it could theoretically be. But Matlab is a very conservative programming environment. Improving the JIT mean modifying it, and this can cause bugs due to the complexity of this task. This concerns other optimizers also, e.g. in C++ compilers: The optimization flags influence the results of large numerical programs. As long as the effects are "small", this is not seen as a *bug*. But Matlab's JIT tries not to change the results. And therefore it has less power.

Btw.:

tic; id = 1:1e6; x(id) = 4; toc % Elapsed time is 0.042795 seconds.

tic; id = uint32(1):uint32(1e6); x(id) = 4; toc % Elapsed time is 0.033133 seconds.

tic; id = true(1,1e6); % Logical indexing x(id) = 4; toc & Elapsed time is 0.025935 seconds.

Show 10 older comments

Matt J
on 26 Nov 2012

@Jan: I haven't seen angainor join you and Daniel in that claim. In fact, we mentioned tests (in Comments to Daniel under my Answer) that seem to refute it. I definitely see x(1:1e8)=2 producing a memory allocation spike in the Task Manager, even when x is already pre-allocated.

And I'm still not seeing the connection between the undocumentedmatlab material and this issue. Some operations are not done inplace, but is there a theory that subsasgn operations are not?

Jan Simon
on 26 Nov 2012

@Matt J: I've copied the wrong link. Actually I meant http://undocumentedmatlab.com/blog/matrix-processing-performance/.

I see that there are arguments for both theories about the (avoided) explicit allocation in "x(1:1e8)". All I've heard about the JIT let me think, that it depends on the specific code. E.g. it can matter if the code runs in the command line or in a function. Therefore I still think, that this discussion is theoretical only, not to say fruitless. We cannot influence the JIT directly, the JIT differs from release to release, TMW has to consider thousnads of different aspects when modifying it. And finally filling the memory with a scalar is most likely not a bottleneck in any real-world program, because this redundant information could be handles much more efficiently.

@angainor: Please lead me back to the actual point: How can we help you or what is the core of your message?

angainor
on 26 Nov 2012

@MattJ @Jan To clarify, I do see memory being allocated when running `x(1:1e8)`

**Please lead me back to the actual point: How can we help you or what is the core of your message?**

My initial intention was to (maybe) get an answer from an engineer at TMW on the issue. I think I have managed to show that indexing is inefficient (5 times overhead over a C loop is in my opinion not reasonable). What I wanted to know is:

- is MATLAB indeed implementing loops by explicitly generating integer indices and storing them as doubles?
- is it a bug that using
`numel(x)`as a loop bound instead of 1e8 slows down the code by factor 2? Should I report it? - how should I report the indexing problem as a whole to TMW so that they seriously consider it?

I have some prior experience with the tech support answering my questions, and let's just say sometimes the questions were too technical to be answered just by any person. This one definitely is a tough one, both technically and from the point of view of future MATLAB design decisions, so I suspect I will have trouble getting it through the technical support. I guess I will do that anyway, now that it is well documented here.

Answer by Daniel
on 23 Nov 2012

MATLAB is not as simple as you suggest. You were looking for the extra read and write overhead you observe with `x(1:end)` and `x(1:1e8)`. I would suggest that this overhead is from `subsasgn` and I would add two additional tests to your list. According to the documentation `x(:)` should be equivalent to

S = struct('type', '()', 'subs', {{':'}}); tic; x = subsasgn(x, S, 4); toc

and the other three examples should be equivalent to

S = struct('type', '()', 'subs', {{1:1e8}}); tic; x = subsasgn(x, S, 5); toc

On my machine

S = struct('type', '()', 'subs', {{':'}}); tic; x = subsasgn(x, S, 4); toc

is slower than

x(:) = 1;

and

S = struct('type', '()', 'subs', {{1:1e8}}); tic; x = subsasgn(x, S, 5); toc

is slower than

x(1:end) = 2;

and

x(1:1e8) = 3;

but the same speed as

id = 1:1e8; % colon(1,1e8); x(id) = 4;

This suggests that MATLAB speeds up `x(:)`, `x(1:end)` and `x(1:1e8)`, possibly by avoiding the overhead of `subsasgn`. I think this is probably the case of `x(:)`. For `x(1:end)` and `x(1:1e8)`, however, I think `subsasgn` is being used since

S = struct('type', '()', 'subs', {{':'}}); tic; x = subsasgn(x, S, 4); toc

takes about the same time as both

x(1:end) = 2;

and

x(1:1e8) = 3;

suggesting some sort of minimum overhead with the `subsasgn` route. This to me suggests that MATLAB is not calling `subsasgn` like the documentation says. This is not surprising since `subsasgn` is a tricky beast in that it cannot be overloaded and the JIT helps where it can.

As for the loops who knows what the JIT is doing. Some things are optimized and some things are not. Some things that were optimized in one version become un-optimized in another version.

Matt J
on 23 Nov 2012

If x(1:1e8) is not creating a temporary index array 1:1e8, as you argued in your Comment to me, then I don't see how it could be calling subsasgn. If it's calling subsasgn, what 'subs' data is being passed to it? You mean it's translating x(1:1e8) into x(:) and then calling SUBSASGN with subs=':'? In my timing results it's taking a really long time to do that translation, on the order of .01 sec.

Incidentally, I'm not clear what you meant by "subsasgn is a tricky beast in that it cannot be overloaded", since people overload it all the time when writing classdefs.

Daniel
on 23 Nov 2012

Yes, subsasgn can be (and often is) overload for custom classes, but it cannot be overloaded for builtin classes. According to the documentation: "Therefore, if A is an array of class double, and there is an @double/subsasgn method on your MATLAB path, the statement A(I) = B calls the MATLAB built-in subsasgn function."

As for x(1:1e8) creating a temporary variable, I am looking into that now. I don't know what the input to `subsasgn` is, but it doesn't seem to require much memory and it appears to be able to handle so transformations (e.g., 1:M:N) without creating a temporary variable ...

angainor
on 24 Nov 2012

Daniel, thanks a lot for the new tests and your observation. I have updated my question with new results. As you will see, they are not like yours. In my case `subsasgn` was faster than `x(1:end)` and `x(1:1e8)`, but much slower than `x(:)` - here the time is the same as original methods 2 and 2 (|x(1:end)|=).

More confused.

About **not calling subsasgn**. I would love that. That is all I want. The matrix/vector indexing to be efficient as in the original

As a side note. Looking at the memory usage of MATLAB in system monitor I see that the memory consumption goes up by ~800MB when I execute `x(1:end)=val`.

Answer by Daniel
on 26 Nov 2012

The fact that different means of indexing take different amount of time is interesting, but probably not that useful to understand. The timings appear to depend on the MATLAB version and possibly OS. In my opinion optimizing MATLAB for

x(1:n) = a;

would be a big mistake. I would much rather have MATLAB optimized for something like

x(randi(n, n/100, 1)) = randn(n, 1);

which I don't think its the same problem.

angainor
on 26 Nov 2012

You are of course right in that it is a simplistic example, maybe not that useful. But I often use constructs like those:

x(1:end-1,:) x(:,1:2:end,:)-x(:,2:2:end,:) x(1:5,:) x(x>0) = val

and so on. And apparently I should be very careful not to use this

x(1:end)

which is silly, but I will not complain about it too much.

Your example with random indices is also not very realistic. But apart from the overhead of `rand` and terribly unlocalized memory access that kills the performance due to cache misses (that is users fault, not TMW), the indexing itself is **exactly** what I tested in method 4:

id = 1:numel(x); x(id) = 1;

and that turned out to be (at least comparing to my C code) 4 times too slow.

I only complain about it, because indexing is a fundamental feature I am not able to improve upon while staying within the MATLAB framework. I do not complain about an inefficient `sparse` function - I go an implement it in my own MEX file. I can not do it here, and so I am not able to deliver the performance I am expected to deliver with MATLAB. And I would like to.

Answer by Jan Simon
on 13 Apr 2015

Some tests revealed, that even logical indexing can be implemented more efficiently: Fex: CopyMask . This C-Mex function is about 2 to 3 times faster than Matlab's built-in method:

function test X = rand(8000,8000); tic for k = 1:10 Y = CopyMask(X, X > 0.2); end toc

tic for k = 1:10 Y = X(X > 0.2); end toc

Opportunities for recent engineering grads.

## 3 Comments

## Daniel (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/54522#comment_112929

Were all of these tests done in functions or at the command line?

## angainor (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/54522#comment_112944

Daniel, the code used is included in the question.

## Malcolm Lidierth (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/54522#comment_113016

See also Yair's blog http://undocumentedmatlab.com/blog/preallocation-performance/