Got Questions? Get Answers.
Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
Which fft package does matlab use?

Subject: Which fft package does matlab use?

From: Nick

Date: 28 Mar, 2013 23:27:06

Message: 1 of 13

I am trying to write (or find code for) a very high performance fft routine (I am specifically interested in the N=256 case).

After benchmarking MATLAB's fft routine I noticed that it runs slower than I would have expected. Using the methodology of fftw webpage (http://www.fftw.org/speed/), and counting flops as 5*N*log2(N) for a complex fft of length power of 2, I benchmarked matlab's fft vs matlab's matrix multiplication as follows:

N=64, GFlopsMM=1.21, GFlopsFFT=2.82
N=128, GFlopsMM=3.99, GFlopsFFT=2.76
N=256, GFlopsMM=7.11, GFlopsFFT=2.89
N=512, GFlopsMM=7.05, GFlopsFFT=2.04
N=1024, GFlopsMM=7.61, GFlopsFFT=2.16
N=2048, GFlopsMM=7.64, GFlopsFFT=2.25

For N>=256, the fft is about 3x slower than matrix multiplication. Does anyone know what package MATLAB uses for fft? does it use fftw, intel mkl or an internal package? Has anyone gotten noticable (factor of 2 or more) improvements using other fft packages within a mex file? - it seems like this should be possible given the above experiment (my cpu is 2.25ghz with support for sse instructions (2 flops in parallel) and 2fpu ports; I calculate the theoretical optimum performance at 2.25*2*2 = 9GFlops).

% code for generating benchmark of fft vs matrix multiplication
Ns=2.^(6:11);
GFlopsMM = nan(1,numel(Ns));
GFlopsFFT = nan(1,numel(Ns));
for s=1:numel(Ns)
    N = Ns(s);
    A=rand(N,N); B=rand(N,N);
    
    NIter = Ns(end)/N; % run more iterations for smaller N for more accurate timing
    tic; for j=1:NIter, A'*B; end; t = toc;
    
    A = A+sqrt(-1)*B; % benchmark complex fft
    tic; for j=1:NIter fft(A); end; t2=toc;
    
    GFlopsMM(s) = NIter*2*size(A,1)^3/1e9/t;
    GFlopsFFT(s) = NIter*5*N*log2(N)*N/1e9/t2; % cost of N ffts of length N
    fprintf('N=%d, GFlopsMM=%3.3g, GFlopsFFT=%3.3g\n',N,GFlopsMM(s),GFlopsFFT(s));
end

Subject: Which fft package does matlab use?

From: Yair Altman

Date: 29 Mar, 2013 08:11:08

Message: 2 of 13

"Nick" wrote in message <kj2jka$o5p$1@newscl01ah.mathworks.com>...
> I am trying to write (or find code for) a very high performance fft routine (I am specifically interested in the N=256 case).
>
> After benchmarking MATLAB's fft routine I noticed that it runs slower than I would have expected. Using the methodology of fftw webpage (http://www.fftw.org/speed/), and counting flops as 5*N*log2(N) for a complex fft of length power of 2, I benchmarked matlab's fft vs matlab's matrix multiplication as follows:
>
> N=64, GFlopsMM=1.21, GFlopsFFT=2.82
> N=128, GFlopsMM=3.99, GFlopsFFT=2.76
> N=256, GFlopsMM=7.11, GFlopsFFT=2.89
> N=512, GFlopsMM=7.05, GFlopsFFT=2.04
> N=1024, GFlopsMM=7.61, GFlopsFFT=2.16
> N=2048, GFlopsMM=7.64, GFlopsFFT=2.25
>
> For N>=256, the fft is about 3x slower than matrix multiplication. Does anyone know what package MATLAB uses for fft? does it use fftw, intel mkl or an internal package? Has anyone gotten noticable (factor of 2 or more) improvements using other fft packages within a mex file? - it seems like this should be possible given the above experiment (my cpu is 2.25ghz with support for sse instructions (2 flops in parallel) and 2fpu ports; I calculate the theoretical optimum performance at 2.25*2*2 = 9GFlops).
>
> % code for generating benchmark of fft vs matrix multiplication
> Ns=2.^(6:11);
> GFlopsMM = nan(1,numel(Ns));
> GFlopsFFT = nan(1,numel(Ns));
> for s=1:numel(Ns)
> N = Ns(s);
> A=rand(N,N); B=rand(N,N);
>
> NIter = Ns(end)/N; % run more iterations for smaller N for more accurate timing
> tic; for j=1:NIter, A'*B; end; t = toc;
>
> A = A+sqrt(-1)*B; % benchmark complex fft
> tic; for j=1:NIter fft(A); end; t2=toc;
>
> GFlopsMM(s) = NIter*2*size(A,1)^3/1e9/t;
> GFlopsFFT(s) = NIter*5*N*log2(N)*N/1e9/t2; % cost of N ffts of length N
> fprintf('N=%d, GFlopsMM=%3.3g, GFlopsFFT=%3.3g\n',N,GFlopsMM(s),GFlopsFFT(s));
> end



Matlab uses FFTW - http://www.mathworks.com/help/matlab/ref/fftw.html

The specific FFTW version changes with each release; in R2013a it is:

>> internal.matlab.language.versionPlugins.fftwVersionInfo
ans =
FFTW-3.3.1-sse2-avx

Yair Altman
http://UndocumentedMatlab.com

Subject: Which fft package does matlab use?

From: dpb

Date: 29 Mar, 2013 17:01:32

Message: 3 of 13

On 3/28/2013 6:27 PM, Nick wrote:
> I am trying to write (or find code for) a very high performance fft
> routine (I am specifically interested in the N=256 case).
>
> After benchmarking MATLAB's fft routine I noticed that it runs slower
> than I would have expected. ...
...
> % code for generating benchmark of fft vs matrix multiplication
> Ns=2.^(6:11);
...
> for s=1:numel(Ns)
> N = Ns(s);
> A=rand(N,N); B=rand(N,N);
     ...
> A = A+sqrt(-1)*B; % benchmark complex fft
> tic; for j=1:NIter fft(A); end; t2=toc;
...

You're comparing N complex FFT's of length N instead of 1...

doc fft % on what it does w/ an array.

--

Subject: Which fft package does matlab use?

From: dpb

Date: 29 Mar, 2013 17:03:28

Message: 4 of 13

On 3/29/2013 12:01 PM, dpb wrote:
> On 3/28/2013 6:27 PM, Nick wrote:
>> I am trying to write (or find code for) a very high performance fft
>> routine (I am specifically interested in the N=256 case).
>>
>> After benchmarking MATLAB's fft routine I noticed that it runs slower
>> than I would have expected. ...
> ...
>> % code for generating benchmark of fft vs matrix multiplication
>> Ns=2.^(6:11);
> ...
>> for s=1:numel(Ns)
>> N = Ns(s);
>> A=rand(N,N); B=rand(N,N);
> ...
>> A = A+sqrt(-1)*B; % benchmark complex fft
>> tic; for j=1:NIter fft(A); end; t2=toc;
> ...
>
> You're comparing N complex FFT's of length N instead of 1...
...

Dangit...type too fast.

It's N complex FFTs of length N/2 instead of one...

--

Subject: Which fft package does matlab use?

From: Nick

Date: 29 Mar, 2013 17:16:06

Message: 5 of 13

Hi dph,
I read the documentation but still dont understand what you mean.
I want to benchmark complex ffts of length N, which should require 5*N*log2(N) flops.


dpb <none@non.net> wrote in message <kj4hh1$nml$2@speranza.aioe.org>...
> On 3/29/2013 12:01 PM, dpb wrote:
> > On 3/28/2013 6:27 PM, Nick wrote:
> >> I am trying to write (or find code for) a very high performance fft
> >> routine (I am specifically interested in the N=256 case).
> >>
> >> After benchmarking MATLAB's fft routine I noticed that it runs slower
> >> than I would have expected. ...
> > ...
> >> % code for generating benchmark of fft vs matrix multiplication
> >> Ns=2.^(6:11);
> > ...
> >> for s=1:numel(Ns)
> >> N = Ns(s);
> >> A=rand(N,N); B=rand(N,N);
> > ...
> >> A = A+sqrt(-1)*B; % benchmark complex fft
> >> tic; for j=1:NIter fft(A); end; t2=toc;
> > ...
> >
> > You're comparing N complex FFT's of length N instead of 1...
> ...
>
> Dangit...type too fast.
>
> It's N complex FFTs of length N/2 instead of one...
>
> --

Subject: Which fft package does matlab use?

From: dpb

Date: 29 Mar, 2013 17:40:24

Message: 6 of 13

On 3/29/2013 12:16 PM, Nick wrote:

...[please don't toppost--hard conversation follow makes]...

... repaired ...

> dpb <none@non.net> wrote in message <kj4hh1$nml$2@speranza.aioe.org>...
>> On 3/29/2013 12:01 PM, dpb wrote:
>> > On 3/28/2013 6:27 PM, Nick wrote:
...

>> >> After benchmarking MATLAB's fft routine I noticed that it runs slower
>> >> than I would have expected. ...
>> > ...
...

>> >> A=rand(N,N); B=rand(N,N);
>> > ...
>> >> A = A+sqrt(-1)*B; % benchmark complex fft
>> >> tic; for j=1:NIter fft(A); end; t2=toc;
>> > ...
>> >
>> > You're comparing N complex FFT's of length N instead of 1...
>> ...
>>
>> Dangit...type too fast.
>>
>> It's N complex FFTs of length N/2 instead of one...

...

> I read the documentation but still dont understand what you mean.
> I want to benchmark complex ffts of length N, ...

Your above A array on which you compute the fft is a 2D array of NxN
which is an N/2-length column-wise complex by N columns.

The ML vectorized FFT routine computes a separate FFT for each column
when passed an array. So, you have computed N complex FFTs each of
length N/2 complex values.

You want a complex _vector_ of length N complex values (which is 2N in
total memory, granted) but only one column to compute the same thing as
does the benchmark.

--

Subject: Which fft package does matlab use?

From: dpb

Date: 29 Mar, 2013 18:33:15

Message: 7 of 13

On 3/29/2013 12:40 PM, dpb wrote:
...

> Your above A array on which you compute the fft is a 2D array of NxN
> which is an N/2-length column-wise complex by N columns.
>
> The ML vectorized FFT routine computes a separate FFT for each column
> when passed an array. So, you have computed N complex FFTs each of
> length N/2 complex values.
...

Take it back...wasn't thinking clearly. After the sum it _is_ N complex
in length but is still N wide. You want only

A=complex(A(:,1),B(:,1));

to be of the proper size for FFT to be consistent w/ the link timings.

Although I didn't look at the actual source code of the benchmark, their
verbiage indicates they actually compute on a zero-ed out array every
time, though.

--

Subject: Which fft package does matlab use?

From: dpb

Date: 30 Mar, 2013 14:32:57

Message: 8 of 13

On 3/29/2013 1:33 PM, dpb wrote:
...

> Although I didn't look at the actual source code of the benchmark, their
> verbiage indicates they actually compute on a zero-ed out array every
> time, though.

Not that the last matters, really, of course--the FFT(0)-->0 so there is
no need to reinitialize between trials.

So now what does your timing look like? Better, I hope? :)

--

Subject: Which fft package does matlab use?

From: Nick

Date: 1 Apr, 2013 10:03:09

Message: 9 of 13

I am taking the fft of each colomn of the NxN matrix rather than a single Nx1 vector to get more accurate timings.

Calling Nx1 fft N times would be slow in matlab (calling fft with an NxN matrix does this internally). Furthermore, if there are optimizations that can be done because we are computing the same length fft N times (pre-computing the twiddle factors for example), passing in the NxN matrix allows fftw to take advantage of that if it chooses.

I am taking this factor of N into account in my computations of the number of flops (see code).

I wrote my own C++ code to do the fft on powers of 2 (highly optimized using sse instructions and efficient memory access patterns) and noticed about a 1.5-1.7x speedup over MATLAB's fft (for N in 128 to 1024 range), achieving about 4.5GFlops on my cpu (compare to above experiment where matlabs fft achieves 3GFlops at peak). I am working on an assembly implementation for x64 windows right now. It seems that fftw isn't as optimized as it could be (unless the interfacing with MATLAB is innefficient). I also noticed that fftw was written in c++ while most good matrix multiplication routines are written in assembly and achieve much closer to the theoretical optimum flop count.

Does anyone have experience with another fft library (like intel MKL) that shows significant improvements over fftw?


dpb <none@non.net> wrote in message <kj4mpe$7me$1@speranza.aioe.org>...
> On 3/29/2013 12:40 PM, dpb wrote:
> ...
>
> > Your above A array on which you compute the fft is a 2D array of NxN
> > which is an N/2-length column-wise complex by N columns.
> >
> > The ML vectorized FFT routine computes a separate FFT for each column
> > when passed an array. So, you have computed N complex FFTs each of
> > length N/2 complex values.
> ...
>
> Take it back...wasn't thinking clearly. After the sum it _is_ N complex
> in length but is still N wide. You want only
>
> A=complex(A(:,1),B(:,1));
>
> to be of the proper size for FFT to be consistent w/ the link timings.
>
> Although I didn't look at the actual source code of the benchmark, their
> verbiage indicates they actually compute on a zero-ed out array every
> time, though.
>
> --

Subject: Which fft package does matlab use?

From: Bruno Luong

Date: 1 Apr, 2013 10:46:10

Message: 10 of 13

> GFlopsMM(s) = NIter*2*size(A,1)^3/1e9/t;
> GFlopsFFT(s) = NIter*5*N*log2(N)*N/1e9/t2;

So you assume that:

(1) the computer does not use any multi-threading during MM or FFT?
(2) MATLAB matrix multiplication is O(n^3) (which is not the optimal)?
(3) memory access data are negligible (usually not right)?

Bruno

Subject: Which fft package does matlab use?

From: Bruno Luong

Date: 1 Apr, 2013 11:03:10

Message: 11 of 13

Also if you have gpu hardware, the fft performs very well.

Bruno

Subject: Which fft package does matlab use?

From: Steven_Lord

Date: 1 Apr, 2013 15:22:01

Message: 12 of 13



"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message
news:kjbohi$n1$1@newscl01ah.mathworks.com...
>> GFlopsMM(s) = NIter*2*size(A,1)^3/1e9/t;
>> GFlopsFFT(s) = NIter*5*N*log2(N)*N/1e9/t2;
>
> So you assume that:
>
> (1) the computer does not use any multi-threading during MM or FFT?
> (2) MATLAB matrix multiplication is O(n^3) (which is not the optimal)?
> (3) memory access data are negligible (usually not right)?

(4) Performance is the only goal or the main goal behind FFT
implementations.

Getting the correct answer quickly is best. [*]
Getting the correct answer slowly is good.
Getting the wrong answer slowly is bad (because you're likely to investigate
the code for performance improvements, which may lead you to detect that the
code returns the wrong answer.)
Getting the wrong answer quickly is worst.

Taking this beyond the logical extreme: MATLAB could be very, very fast if
it simply returned [] for all operations. Of course, that wouldn't be very
useful now would it?


[*] Different applications have different definitions of "the correct
answer" -- for instance, this wouldn't satisfy the definition used by
MATLAB:

http://en.wikipedia.org/wiki/Fast_inverse_square_root

But for Quake 3's graphics calculations, it was good enough.

--
Steve Lord
slord@mathworks.com
To contact Technical Support use the Contact Us link on
http://www.mathworks.com

Subject: Which fft package does matlab use?

From: Nick

Date: 2 Apr, 2013 03:09:11

Message: 13 of 13

(1) benchmarking without multithreading - in windows task manager right click matlab and set processor affinity to only 1 cpu
(2) what does matlab use if not N^3 matrix multiplication?
(3) I am benchmarking number of useful operations (flops for the actual computation). Memory access and programming inneficiencies lead to deterioration from the optimum - I am paying close attention to memory access and L0 (register level caching) in my implementation. The large gap between the theoretical optimum and the empirical runtime suggested to me that improvements in the implementation can be made.

"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <kjbohi$n1$1@newscl01ah.mathworks.com>...
> > GFlopsMM(s) = NIter*2*size(A,1)^3/1e9/t;
> > GFlopsFFT(s) = NIter*5*N*log2(N)*N/1e9/t2;
>
> So you assume that:
>
> (1) the computer does not use any multi-threading during MM or FFT?
> (2) MATLAB matrix multiplication is O(n^3) (which is not the optimal)?
> (3) memory access data are negligible (usually not right)?
>
> Bruno

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us