|
"Chaos" <rothko.fan@gmail.com> wrote in message <grspb1$ds8$1@fred.mathworks.com>...
> "Steve Amphlett" <Firstname.Lastname@Where-I-Work.com> wrote in message <grsn3p$n9l$1@fred.mathworks.com>...
> > "Steve Amphlett" <Firstname.Lastname@Where-I-Work.com> wrote in message <grsmqd$6e3$1@fred.mathworks.com>...
> > > "Chaos" <rothko.fan@gmail.com> wrote in message <grpt1t$i1e$1@fred.mathworks.com>...
> > > > "Sky " <theskyishigh@yahoo.com> wrote in message <grokia$rcq$1@fred.mathworks.com>...
> > > > > I have ported a code from FORTRAN95 to MATLAB. Replicated it almost exactly. About 500 lines in length, it contains a lot of double precision arithmetic and nested iterations. Very little linear algebra.
> > > > >
> > > > > Under MATLAB R2008b it executes in 7.3 seconds. Compiled under Compaq Visual Fortran 6 it takes 375 milliseconds. Under the Intel Fortran 11 compiler, it takes 473 MICROseconds. These times are for a WinXP system, Core 2 Duo 2 GHz (4MB L2 Cache), 2GB RAM...
> > > > >
> > > > > Is this possible? Is MATLAB this slow or am I doing something wrong?
> > > >
> > > > CVF -> no OpenMP, no SSE2, no SSE3, no parallelzation, no Real(16), old version of IMSL
> > >
> > > True, but it still doesn't add up to a 1000 speed multiplier. OpenMP for a 2 CPU, maybe 1.7x, SSE2 and/or SSE3, maybe 2x more.
> > >
> > > If I saw these ratios, I'd be woderring if we were comparing bebug with mega-complied code.
> > >
> > > The outputs of some profiling tests would be interesting.
> >
> > ... My work PC might still have DVF/CVF on it (I still have the media) as well an ifort (v10 I think, possibly v11). I may run some speed tests using some trivial FORTRAN code with FP loops. If there really is a 1000 speedup I'd like to know how to get it.
>
> if his routine using quad cpu, he'll get at almost 3.8X speed, if he's liked to MKL FFTW or ACML lib at least another 2 to 4X per cpu
With all respect, do the multiplications. OP has 2 cores, 4x speedup per core is still only 8x (assuming OMP). Add in another (unlikely) 2x for compiler tech and (optimistically) 2x for chip instruction set optimization and you're still only at 32x. Need to find a reason for the additional 30x.
A big problem that takes many tens of seconds needs to be benchmarked.
|