In parfor-loop, can I call a multi-threaded mex and get some speed-up?

I learned the concept of multi-threaded mex from undocumentedmatlab. (It seems this website is unaccessible now ...)
I am wondering if I can call a multi-threaded mex in parfor-loop.
My current code looks like
parfor k=1:1e6
result(k) = mex_wrapper(data(k));
end
mex_wrapper.c looks like
double calculate()
{
int N=50;
for (i=0;i<N;i++)
{
//...
}
}
void mexFunction()
{
calculate();
}
The iterations inside calculate() are independent, so I want to change the sub-routine calculate() to support multi-thread.
Although I am running parfor in process-based-environments, I am not sure if multi-threaded mex would confict with parfor.
So can I use multi-threaded mex in parfor? And would I get some speed-up by doing so?

 Accepted Answer

You should be able to run a multi-threaded MEX file correctly inside a parfor loop. However, you will be oversubscribing your machine. For example, if your machine has 6 cores, your parfor loop will run 6 copies of your MEX function simultaneously. If each of those uses 6 threads each, you will have 36 threads active on your machine. This should work, but it will probably be less efficient than having a single-threaded MEX function. (A multithreaded MEX function inside parfor can be more useful when you have a cluster of machines - there, you might run single worker process per machine, and have each machine run the multithreaded MEX function).

3 Comments

I am a little confused about "run single worker process per machine". First, I do not know how to set this up. Second, I think by doing this it will slow down the whole task.
Here is some detailed information about my task. Currently, I am running my task on a cluster that have several nodes, node01,node02, etc. I think the term "node" is equivalent to your word "machine". Each node has 2 CPUs, each CPU has 14 cores, and with hyper-threading, that is 56 workers that I can use in parfor-loop. Generally, I would run parfor-loop only on a single node, because the communication overhead between different nodes is somewhat expensive. When a single iteration runs really fast, e.g. 1 second, maybe this communication overhead can not be overlooked.
If I get your point right and I managed to run single worker on a node, it means I would run my task over several nodes. Within each node, I only use 1 worker while leaving the other 55 workers idle. This is inefficient since the other workers did nothing.
I am also confused about the word "oversubscribing". I want to do multi-thread in pure C language, i.e. in the C language sub-routine calculate(). Why would this multi-thread has something to do with matlab worker process? I do not have a background in computer science, I think what I did is just calling a sub-routine.
By the way, I tried maxNumCompThreads() on my local machine, that is 6, as you said. Can I use more threads, e.g. 50, in C language? As I said above, in my view, the thread in C language has nothing to do with maxNumCompThreads().
Yes, "node" and "machine" are the same thing.
Generally, hyperthreading doesn't actually offer much practical benefit for most MATLAB computations, which is why maxNumCompThreads returns the number of physical cores. This the number of computational threads MATLAB uses for operations like fft . It's also the default number of processes parpool('local') will launch.
"Oversubscription" is the key concept here - essentially this is about how many operations you're asking a given node to perform simultaneously (i.e. how many threads are running on the node). If you ask a node to perform more operations simultaneously than it has hardware cores, then some of those operations must wait. So, if you run parpool('local'), each worker process will (by default) have a single computational thread. This will fully occupy the CPU of that node. If your MEX file runs multiple threads, then you will have more active threads on the node than the node can support simultaneously, and so those threads will have to share the cores on the node. As I said, this should work, but it will not get you any additional performance, because the node's hardware was already being fully occupied by the single-threaded version.
Thanks, Edric, I got your points. I tried to use maxNumCompThreads() in parfor-loop, it returns nothing. So this is infeasible.

Sign in to comment.

More Answers (0)

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!