counting pattern frequency in a string
Show older comments
Looking for ideas on the fastest way to count the number of occurrences of letter patterns in a string. For example, for the test string 'ZXCVBNMZXCVBAS' with a pattern length of 4, it would generate the following table: ZXCV=2, XCVB=2, CVBN=1, etc... The brute force way to do it is to start with an empty list of 4-letter strings, and pull 4-letters blocks from the test string, moving over one letter at a time. For each block, check if already on the list, if so increment the count; if not, add to the end of the list.
The problem with this is that if you have very long test strings (say 10^8 and higher) and move to higher pattern 'word' lengths, say 8 or 10 letters, the number of unique patterns is huge and so the thing slows down as each 8 letter block is compared against a huge list.
Another idea would be to assign each possible pattern with an index: AAAA=1, AAAB=2, so that you can just calculate the index using base 26 conversion for any given string, and increment the value at that index of a master vector. (eg AABA=27, etc., so increment vector(27) by one). But again, with longer pattern lengths, there is not enough memory to store a vector with that many indices, covering all the words (26^8 or 26^10, etc.).
So, is there some way to do this efficiently when the strings and pattern lengths get big? Something with sparsity, or smart sorting, or indexing?
Thank you
2 Comments
Jos (10584)
on 22 Dec 2017
Basic questions first: do you really need all patterns? If so, why?
Accepted Answer
More Answers (0)
Categories
Find more on Characters and Strings in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!