Allow fastawrite() to pass in a string array or speed it up as is for many sequences?

1 view (last 30 days)
Hello, thank you for your time in reading this question.
I currently need to generate structure predictions for 1368*1368 sequences using the AlphaFold batch verion colab notebook . To do so would need my sequences of interest to be listed as .fasta files in my google drive.
I initally had my data containing the sequences and their corresponding deletions that generates the sequences together in a cell array. I began by using fastawrite() to create each file, using a for loop, and realized very quickly it will take a long time (~40 fastafiles generated per minute, but that would still take nearly 28 days straight to generate all .fasta...). I am aware that vectorizing my code is a more optimal way to perform such a task, but I ran into trouble attempting to do so since I can't index into an array I've already indexed into (is what I took away from the errors I was getting).
I thought maybe if I move away from the cell array and work with a string array, I might have a better time, but ran into the issue of fastawrite() only taking in input of character vector or string scalar. I've tried to modify the fastawrite() script to see how I can repurpose it to take in a string array but am not having any luck with that either. I'm wondering then how can I make this faster? Will a parfor loop instead of a for loop be any faster in this particular scenario?
%the actual sequence
A = convertCharsToStrings(reshape(matrixofsequences, (1, [1368*1368])');
% the deletion coordinates will be the file name
A_coordnates = convertCharsToStrings(reshape(aa_deletion_num, (1, [1368*1368])');
B = [A, A_coordinates];
B(:, 2) = strcat(B(:,2), '.fasta');
%what I want to do:
fastawrite(B(:,2), B(:,1));
%Maybe I can try this?:
parfor k = 1:length(A)
fastawrite((B(k, 2), B(k,1));
end

Accepted Answer

Walter Roberson
Walter Roberson on 6 May 2023
Edited: Walter Roberson on 6 May 2023
parfor or parfeval (maybe background pool) could potentially be faster, but not necessarily
  • if each fastawrite involves computation preparing the output and preparing is not very vectorized, then potentially the preparation could be overlapped. If this works then it would push the problem to the second issue
  • there is I/O to write the files. Multiple writes directly to Google Drive will not be faster: prepare files locally and copy them. For local hardware the general guideline is to have two to three simultaneous writes per channel per controller (if you have multiple USB devices on the same controller the math gets more difficult.)
Having too many outstanding write requests slows everything down. If everything is going to the same controller channel (for example all to the same drive) then you will probably not get an i/o performance increase past 3 or 4 processes.
  2 Comments
Emmanuel Osikpa
Emmanuel Osikpa on 7 May 2023
Thank you very much for your comments.
Okay I see, so to your first point I don't think? there is computation involved, more so just checking if various conditions are met and placing the inputs into the newly created file or giving an error. And since its only operating on one string scalar or char vector, I don't think it is necesarily vectorized either. This is just my interpreation, I could be wrong. And to your second point, I should have clarified, yes I am preparing them locally and will prepare to copy them to Google Drive later.
I initally was saving fasta using the wrong file type extension so I had to start over, and in between then, I've switched to using a parfor instead of the intial for loop i was using. Its currently running and I will say its slightly faster than before (based on the amount of files that have been written so far in this time span; I didn't think to check actual time measurements before I restarted)
Walter Roberson
Walter Roberson on 7 May 2023
It looks like FASTA format is quite simple -- not much more involved than breaking it up into lined of appropriate length. That would not take much computation at all. As the different string() entries could be different lengths, there would be little that could be gained by processing multiple strings at the same time (vectorizing would not help much at all, and would possibly even slow things down.)
So you are probably being limited by file I/O. As I indicated above, if the outputs are all going to the same drive, then it is easy to saturate the channels with only a few simultaneous writes.
On the other hand, 1368 formatted would fit easily within one file buffer (buffers are typically 4096 bytes), so the limit might turn out to be directory services.

Sign in to comment.

More Answers (0)

Categories

Find more on Software Development Tools in Help Center and File Exchange

Products


Release

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!