Does Matlab automatically vectorize for loops?

I currently have Python code that takes 1.5 hours to calculate 450+ million Levenshtein distances between 30,011 text labels of no more than 20 characters each. Currently migrating it to Matlab to see the JiT compilation allows optimizations of flow control. The Python implementation uses Anaconda's CPython, so no JiT, and there is a distinct possibility that Matlab could be much faster.
Years ago, I recall reading that Matlab automatically vectorizes for loops. I cannot find any such mentions now. Is this just assumed to be part of the JiT compilation?
There will be 2 levels of for-loops, one nested within the other, to iterate over matrix rows and columns. Each iteration will call editDistance, so I don't know if that will prevent vectorization, even if automated vectorization was available.

 Accepted Answer

Years ago, I recall reading that Matlab automatically vectorizes for loops. I cannot find any such mentions now. Is this just assumed to be part of the JiT compilation?
That is not done. (Or if it is done, it is in fairly narrow circumstances.)

More Answers (1)

In general we don't discuss what the JIT does internally as we don't want users to try to code against that always-moving target. [What if you write your code in a somewhat "unnatural" way to avoid some JIT limitations and the next release we remove those limitations and make the more "natural" implementation much faster? Would you ever notice the change?]
Searching the online documentation for the term "Levenshtein" found four functions, two each in Text Analytics Toolbox and Computer Vision Toolbox, whose reference pages include that term. One or more of those may be useful to you. There are also a few hits in MATLAB Answers and in the File Exchange, if you look at the search results without the scoping to just functions in MathWorks products.

4 Comments

FM
FM on 12 Jun 2024
Edited: FM on 13 Jun 2024
Thanks, Steven. The OCR functions don't seem to return the distances themselves.
The MEX file for the C++ implementation from your 2nd set of search results may help, but at the moment, I am exploring the potential for Matlab JiT to ovecome the control flow overhead of the nested loops. If that is not sufficient, I can look at other avenues, including whether the MEX file is vectorized (it says it compares between 2 strings). Due to our computing environment, there may be additional challenges arising from the fact that it is not part of Matlab, and I'm trying to avoid that set of unknowns if possible.
Update: Based on initial progress metrics, the Matlab implementation using editDistance will take about 69 hours. Before looking further more for an improvement over Python, I will see if I can get the insight I need from the existing distance calculations.
Note that the Matlab time is on a different machine from the Python time. Each machine has access to the latest version of a different app, so I can't compare on the same machine.
To get an idea of how much of the 69 hours is due to loop flow control versus Levenshtein distance calculation, I replaced editDistance(String1,String2) with the much faster min( strlength(String1), strlength(String2) ), which reduced the projected time to 1 hour. So most of the time is the Levensthein calculation. This is assuming that the aforementioned much simpler minimum string length expression above isn't vectorized (I can see how vectorization can be automated, at least for the inner loop).
Depending on what you actually want to do with the resulting values, you may find editDistanceSearcher of interest: In many applications, you are not interested in all 450+ million values, but only in nearest neighbors less than some (small-ish) threshold away, and always checking against some predetermined set of strings. editDistanceSearcher makes that operation considerably faster, paying for it with some memory and precomputation on the set of strings.
Side note: Like many MATLAB functions, editDistance itself is vectorized, meaning you can call editDistance(["str1" "string2"],"str3") and get both edit distances at the same time. That does not necessarily mean it is faster than your loop, but it makes your code easier to read and should not be slower. (Ignoring special cases like having a break command in the loop.)
editDistance is vectorized from a user interface perspective, but it's not really. As best I can tell, it basically uses arrayfun under the hood to call itself in a loop for all of the pairs of scalars in the input. And on each call it does argument validation and various other checks that really only need to be done once. I haven't done any testing, but I suspect editDistance could be modified to be (much?) more efficient.
FM
FM on 14 Jun 2024
Edited: FM on 14 Jun 2024
@Paul: I suspected as much. Thanks.
@Christopher: I'm actually trying to explore the clustering of the 30,011 labels to see which can be considered typos or embellishments of each other. At the outset, I don't have specific words for which I'm searching for possible matches witin a certain edit distance. After exploring the clustering, I might be able to set such thresholds.

Sign in to comment.

Categories

Products

Release

R2023b

Asked:

FM
on 12 Jun 2024

Edited:

FM
on 14 Jun 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!