I am developing an application that MUST take advantage of parallelization, and ideally offer real-time updates after each iteration, which makes use of parfeval prefarable. I believe the algorithm that I have developed is highly parallelizable (see attached for performance of 'WT_Ex_2_b' as a function of number of cores used in parfeval function). From 1 to 8 cores, the speedup factor agrees with theoretical expectation (Amdahl's Law with p=0.95), however, performance of my application saturates at 8 cores. This led me to create a dummy function (see attached script) to compare the performance of using parfor and parfeval as a function of number of cores. I discovered that the parfor version behaves quite similarly to theoretical expectation (Ahmdal's Law, also with p=0.95), however the parfeval version continues to show strange saturation behavior, even for the dummy function. Notice how the Speedup factor improves with core number upto 12 cores, then suddenly no further improvement is observed. I have attached the script in case you want to reproduce this behavior on your end.
Is there a fundamental limitation to the number of cores the parfeval function can leverage? Or is there an obvious mistake I am making in the way I am using the parfeval function? Why does the performance behavior of the dummy algorithm suddenly saturate at 12 cores? Any recommendation how to use the parfeval function to perform as well as parfor?
I would like to emphasize that I have already developed my application to use parfeval, so converting to parfor would be time-consuming and prevent me from utilizing the update-after-iteration feature of parfeval.
Thank you for your help on this critical matter.