Asked by Jan Simon
on 3 Sep 2012

I have created a mutli-threaded version of the C-Mex function of FilterM (you do not need to download this to answer the question). The speed-up scales well with the number of processors for large problems. But for small problems, a *smart* method is required to choose the optimal number of threads, because starting a thread consumes a remarkable period of time. E.g. starting a 2nd thread for filtering the columns of a [1000 x 2] matrix is 50% slower than perform the job in the single main thread. For my example the columns of the input matrix are distributed to the different threads.

The length of the signal, the number of channels, the order of the filter and the FIR/IIR type, the number of available cores and the current system load matter the best choice.

Matlab uses magic limits for some multi-threaded functions:

- SUM starts one thread per core for a [1 x n] vector and n >= 89000 (in consequence there is a slow-down on a single core CPU)
- FILTER starts one thread per core for matrices of >= 16 columns

A better strategy is to start nCore-1 threads and calculate one chunk of data in the main thread.

I know how to solve a multi-parameter optimization problem, but even with this I would get optimal parameters for my own processor only. And solving this on the individual client computer (and for the current processor load...) is clearly an overkill.

What are standard and smart methods to choose the number of threads for a specific problem?

*No products are associated with this question.*

Answer by Jan Simon
on 5 Mar 2013

Edited by Jan Simon
on 16 Dec 2014

Although multi-threading is an essential topic in the times of multi-core processors, the obvious lack of answers is a clear mark:

There is no sufficient general strategy to decide for the number of cores.

There are too many factors, which influence the efficiency of the distribution to threads:

- Hyper-threading can increase or decrease the processing time.
- The (dynamically changing) number of not busy cores is unknown.
- Turbo-boost can slow down the processing, when more cores are active. (Btw., "turbo-boost" is a funny name for a feature, which slows down the processor, when all cores are busy. It is amusing, that the processor manufacturers choose the opposite view, that it runs
*faster*, when some cores are sleeping.) - It cannot be estimated, if the caches are exhausted.
- The time to start a new thread differs widely between different processors.

Therefore the following might be a fair solution:

- Let N be the number of real (or virtual) processors.
- Split the work into M independent chunks.
- While there are unprocessed chunks and the number of started threads <= N
- Start a new thread which fetches a new chunk autonomously until all data are processed.
- Back to 3.
- Close threads.

Then starting a new, but not needed thread wastes time, but only on a core, which is not working on the problem yet. A small problem will be solved before all threads are started.

Of course this is not an optimal strategy also, but I think it is more efficient than any fixed relation between the number of indpendent chunks and cores.

Opportunities for recent engineering grads.

## 0 Comments