- if each fastawrite involves computation preparing the output and preparing is not very vectorized, then potentially the preparation could be overlapped. If this works then it would push the problem to the second issue
- there is I/O to write the files. Multiple writes directly to Google Drive will not be faster: prepare files locally and copy them. For local hardware the general guideline is to have two to three simultaneous writes per channel per controller (if you have multiple USB devices on the same controller the math gets more difficult.)
Allow fastawrite() to pass in a string array or speed it up as is for many sequences?
1 view (last 30 days)
Show older comments
Emmanuel Osikpa
on 6 May 2023
Commented: Walter Roberson
on 7 May 2023
Hello, thank you for your time in reading this question.
I currently need to generate structure predictions for 1368*1368 sequences using the AlphaFold batch verion colab notebook . To do so would need my sequences of interest to be listed as .fasta files in my google drive.
I initally had my data containing the sequences and their corresponding deletions that generates the sequences together in a cell array. I began by using fastawrite() to create each file, using a for loop, and realized very quickly it will take a long time (~40 fastafiles generated per minute, but that would still take nearly 28 days straight to generate all .fasta...). I am aware that vectorizing my code is a more optimal way to perform such a task, but I ran into trouble attempting to do so since I can't index into an array I've already indexed into (is what I took away from the errors I was getting).
I thought maybe if I move away from the cell array and work with a string array, I might have a better time, but ran into the issue of fastawrite() only taking in input of character vector or string scalar. I've tried to modify the fastawrite() script to see how I can repurpose it to take in a string array but am not having any luck with that either. I'm wondering then how can I make this faster? Will a parfor loop instead of a for loop be any faster in this particular scenario?
%the actual sequence
A = convertCharsToStrings(reshape(matrixofsequences, (1, [1368*1368])');
% the deletion coordinates will be the file name
A_coordnates = convertCharsToStrings(reshape(aa_deletion_num, (1, [1368*1368])');
B = [A, A_coordinates];
B(:, 2) = strcat(B(:,2), '.fasta');
%what I want to do:
fastawrite(B(:,2), B(:,1));
%Maybe I can try this?:
parfor k = 1:length(A)
fastawrite((B(k, 2), B(k,1));
end
0 Comments
Accepted Answer
Walter Roberson
on 6 May 2023
Edited: Walter Roberson
on 6 May 2023
parfor or parfeval (maybe background pool) could potentially be faster, but not necessarily
Having too many outstanding write requests slows everything down. If everything is going to the same controller channel (for example all to the same drive) then you will probably not get an i/o performance increase past 3 or 4 processes.
2 Comments
Walter Roberson
on 7 May 2023
It looks like FASTA format is quite simple -- not much more involved than breaking it up into lined of appropriate length. That would not take much computation at all. As the different string() entries could be different lengths, there would be little that could be gained by processing multiple strings at the same time (vectorizing would not help much at all, and would possibly even slow things down.)
So you are probably being limited by file I/O. As I indicated above, if the outputs are all going to the same drive, then it is easy to saturate the channels with only a few simultaneous writes.
On the other hand, 1368 formatted would fit easily within one file buffer (buffers are typically 4096 bytes), so the limit might turn out to be directory services.
More Answers (0)
See Also
Categories
Find more on Software Development Tools in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!