As @Jan pointed out, it is very important that you run your profiler to find out where your time is being spent and focus on those areas; and remove recursion if that is a major cpu hog. Additionally, if you are doing I/O in the middle of a loop, pay special attention to that to keep the core out of a wait state.
"Tip: Always run the outermost loop in parallel, because you reduce parallel overhead."
>> I have 180 subjects on a NAS, each PC computes 20
Ok, I think I understand better now about your 9 (=180/20) computers. They are solving independent problems where independence is defined by Subject at the top-level. Your 20 subjects per PC should be highly parallelizable since there is no interaction or memory sharing amongst the subjects. I guess that after the 9 computers complete their task, you bring the data together.
>> Separating the computations in this way makes not necessary the use of Parallel Server, does it?
These 9 computers are not talking to each other during the processing of 180 subjects, right? If so, then you do not need the Parallel Server. The Parallel Server is to computer clusters (or cloud server clusters) as the Parallel Computing Toolbox is to computer cores or GPU cores. That is, both make your programming job easier if you follow some rules (albeit sometimes tricky rules that are not easy to diagnose as most of us have already experienced. If you have a maintenance contract with MathWorks, they will be able to help with screen-sharing).
If you purchase 9 more computers, you should halve part of your workflow time since each computer only needs to process 10 subjects instead of 20. Of course this assumes little contention in the NAS system.
If you purchase larger core CPUs, that could also make an improvement provided that you increased your memory by proportional amounts. But surprisingly, our 32 core server outperformed our 70+ core server. The possible reason was that the latter was using NUMA architecture.
As an aside, you should know that using parfor will produce slightly different results if using float or double. Even a single built-in MATLAB function can return different results. I demonstrated this to MathWorks with a short script. They said this is not a bug and there would be no fix. I admit that trying to solve this problem is not easy, but it is annoying, and I consider it a bug. (It is somewhat analogous to Intel saying that there floating point operation was slightly off, but that should not concern most of the users due to very small errors. Their stock went down 5% the next day, and then they said they would fix the problem.) (If using only integer or fixed-point arithmetic, your results should be the same when using parfor.)
>> I use the parfor in the last one (nn = Q) otherwise, I get an error.
This is important that you fix the error. Please review parfor variable classification to better understand your parfor errors when you parfor at a higher level. Each core can handle one subject nicely in a for-loop.
From this link is a notable quote:
“If you run into variable classification problems, consider these approaches before you resort to the more difficult method of converting the body of a parfor-loop into a function.”
It was not difficult for me to do this, and maybe it will not be so difficult for you. Give this a try.
parfor ii = 1:length(subject)
processSubject( subject(ii) );
Keep in mind that since you are already highly vectorized, then parfor will not have as much a dramatic effect if you had just written unoptimized code; but it will still help, maybe 3x-5x rather than 20x.
In C/C++ HPC environments, cache misses are a major concern. Your vectorization very likely is doing a good job. (In the past, a Windows task monitor could show 100% usage even though a CPU was in a wait state waiting for a word to be retrieved when caused by a cache or page miss.) In Linux, there are tools (e.g., pahole, cachegrind) that can identify cache problems. As you know, you pay a lot for higher level cache.
Which brings up C/C++. As you know, MATLAB can automatically convert your MATLAB code to C++. (There are two optional packages for that, and the claim is that you can get a 3-5x improvement. I believe that you can get a trial version for them.) But not all functions are automatically able to be converted so learning MEX may be required. And to get proper C++ vectorization, you need to have the libraries and learn the APIs for MKL and IPP. (I got a 15x improvement by changing only a few lines of code in the innermost 10-line function by switching to IPP primitives taking advantage of SIMD.) Do a profile and find out where your time is being spent, and then decide if you can use C/C++ MEX. (Now, latest versions of the C++ valarray library does use SIMD, and valarray makes it easier to convert MATLAB into C++.)
A word on GPU: "Gathering back to the CPU can be costly, and is generally not necessary unless you need to use your result with functions that do not support gpuArray." Based on your notes, it appears that you will have to do careful analysis after profiling your program to see whether GPU will prove beneficial.
"However, people have been reporting that current versions of MATLAB are not able to use the full power of MKL (Math Kernel Library) equivalents, possibly due to the way that Intel wrote some tests of CPU capabilities into the code."
Walter gives a positive explanation as to why MKL did not boost AMD CPU as much as expected; but the story is darker. Intel was sued because they hid the fact that they purposely put in code in MKL to detect AMD product and ensured that it would run as well as on an INTEL CPU. I forget the outcome of the lawsuit, but I think it was that Intel had to disclose up-front that a purchase of MKL would not have the desired benefit on AMD CPUs that are reported on INTEL. There are many articles on this subject. Here is one that I quickly found and there may even be a work-around.
If you determine that GPU is not going to help you, then consider this question:
When is an i9 processor worth the money?
“They are generally not really worth the money at all.”
- AMD Threadripper 16-core 32 threads $880
- Intel i9-7960X 16-core 32 threads $1725