Thread Subject: Matlab runtime for Bioinformatics applications

Subject: Matlab runtime for Bioinformatics applications

From: Steve Hung

Date: 23 Aug, 2011 19:01:26

Message: 1 of 4

Hi everyone,

My name is Steve, I am a post bac intern working in a lab that analyzes patterns in where a transposon (mobile DNA) integrates in the genome.

I was wondering if you can help me with some general questions I have about Matlab. My PI recently asked me to work on code that can screen a large set of DNA sequences (20 million) for ones that meet certain requirements (contain a subsequence, have a certain length, etc).

A post-doc in our lab has already written code that accomplishes this using Perl. Running his code, the job is done in less than 5 minutes. I tried to do the same thing using Matlab, but it takes more than 12 hours! I have tried everything I know and read about in the Matlab Help pages, to speed up the code.

I was wondering if you knew if this large difference in processing time is typical. The Matlab version I used is R2011a, which I have access to through my university's virtual computer lab (could this affect the performance?).

Any response would be much appreciated. My PI is interested in seeing if Matlab is worth the investment.

Best, Steve

Subject: Matlab runtime for Bioinformatics applications

From: Kirill

Date: 23 Aug, 2011 21:42:14

Message: 2 of 4

On Aug 23, 3:01 pm, "Steve Hung" <shoghu...@gmail.com> wrote:
> Hi everyone,
>
> My name is Steve, I am a post bac intern working in a lab that analyzes patterns in where a transposon (mobile DNA) integrates in the genome.
>
> I was wondering if you can help me with some general questions I have about Matlab. My PI recently asked me to work on code that can screen a large set of DNA sequences (20 million) for ones that meet certain requirements (contain a subsequence, have a certain length, etc).
>
> A post-doc in our lab has already written code that accomplishes this using Perl. Running his code, the job is done in less than 5 minutes. I tried to do the same thing using Matlab, but it takes more than 12 hours! I have tried everything I know and read about in the Matlab Help pages, to speed up the code.
>
> I was wondering if you knew if this large difference in processing time is typical. The Matlab version I used is R2011a, which I have access to through my university's virtual computer lab (could this affect the performance?).
>
> Any response would be much appreciated. My PI is interested in seeing if Matlab is worth the investment.
>
> Best, Steve

Steve -- you should run a profiler on your Matlab code and check where
it wastes most time. From this point you could see if the efficiency
of it could be improved. 5 minutes and 12 hours it is a big
difference – I would not expect Matlab doing so bad. You are not
reloading all 20M sequences in a loop, aren’t you?

Kirill

Subject: Matlab runtime for Bioinformatics applications

From: Nasser M. Abbasi

Date: 23 Aug, 2011 22:52:37

Message: 3 of 4

On 8/23/2011 12:01 PM, Steve Hung wrote:

> I was wondering if you knew if this large difference in processing time is typical.

No. It is not typical. Your coding could very well be not efficient Matlab.

If I want, I can write something in assembler and make it run slower by 100
times than some Perl or Matlab code which does the same thing. It does not
mean anything, other than the way I wrote the assembler code was bad.

hth,

--Nasser

Subject: Matlab runtime for Bioinformatics applications

From: Steve Hung

Date: 30 Aug, 2011 01:35:27

Message: 4 of 4

Hi everyone,

Thanks for the replies, and a BIG thanks to a Mathworks software scientist who looked at my code, found where it was inefficient, and gave me a better solution. I am very grateful.
I didn't realize that for and while loops are not efficient in Matlab. One of the bottlenecks in the code was reading in the sequences, which I was doing line by line with 'fgets' in a while loop. The much better way to do it is to use textscan, and specify in the arguments how many lines you want to read in at once, then process that block. Along with some other fixes with the same theme (using repmat instead of building a character array line by line), the runtime dropped from 12 hours to less than 2 hours. I bet there is room for more improvement, and it can run even faster in the future!
Sincerely,
Steve


"Nasser M. Abbasi" <nma@12000.org> wrote in message <j31b0n$v63$1@speranza.aioe.org>...
> On 8/23/2011 12:01 PM, Steve Hung wrote:
>
> > I was wondering if you knew if this large difference in processing time is typical.
>
> No. It is not typical. Your coding could very well be not efficient Matlab.
>
> If I want, I can write something in assembler and make it run slower by 100
> times than some Perl or Matlab code which does the same thing. It does not
> mean anything, other than the way I wrote the assembler code was bad.
>
> hth,
>
> --Nasser
>

Tags for this Thread

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

rssFeed for this Thread

Contact us at files@mathworks.com