convert files into matrix

1 view (last 30 days)
huda nawaf
huda nawaf on 7 Nov 2011
hi,
I have 177000 files, I have to create matrix contain all values in these files.
Each file was split using textscan to get
c{1},c{2},........
then convert it into matrix.
Then convert these matrices into one matrix.
the problem is these files contain some similar values, so I have to specify the similar values ,and drew all other attached values(row) with these values.
I tried running with 100 files to know running time , I found out the running time is very long for just 100 files.
I think if I find function can compare among c{1}for all files, and among c{2} for all files ,...etc . I think that will save time. I'm facing problem with this code:
targetdir = 'd:\social net\dataset\netflix\training_set';
targetfiles = '*.txt';
fileinfo = dir(fullfile(targetdir, targetfiles));
k=0;arr(:,:)=0; inc=0;k=0;y=1;
for i = 1: length(fileinfo)
thisfilename = fullfile(targetdir, fileinfo(i).name);
f=fopen(thisfilename,'r'); f1=fscanf(f,'%c'); f1(1:2)=[];
f2=fopen(thisfilename,'w'); fprintf(f2,'%c',f1);
f3=fopen(thisfilename,'r');
c = textscan(f,'%f %f %s','Delimiter',',','headerLines',1);
c1=c{1};c2=c{2}; c3=c{3};z=1;z1=1;z2=1;z3=0;
for k=1+k:length(c1)+inc
no=c1(z); arr1=arr(:,1); p=find(arr1==no);
if isempty(p)
j=1;
arr(y,j)=c1(z); arr(y,j+1)=i; arr(y,j+2)=c2(z);j=j+3;y=y+1;
else
ind(i,z1)=p;
L=arr(p,:);len=0;
for h=1:length(L)
if L(h)~=0
len=len+1;
end
end
len;
arr(p,len+1)=i;
arr(p,len+2)=c2(z);
z1=z1+1;
end
z=z+1;
end
inc=inc+length(c1);
[u,u1] =size(arr);
end
f4=fopen('netfile.txt','w');
for i=1:u
for j=1:u1
fprintf(f4,'%d ',arr(i,j));
end
fprintf(f4,'\n');
end
fclose all;
thanks
  1 Comment
huda nawaf
huda nawaf on 8 Nov 2011
please need advices about the above code.
may one can add some improvements to make it run easily .
please why it is running is very very slow.
how can improve it.
thanks in advance

Sign in to comment.

Accepted Answer

Daniel Shub
Daniel Shub on 9 Nov 2011
What version of MATLAB are you using? It looks like arr is growing in your loop. Prior to r2011a (???) preallocating a variable can speed things up. If you do not know the final size, reallocating in large chunks can speed things up.
Where are the files saved (locally, network drive, flash drive, external harddrive)? A fast internal harddrive will give you the fastest read times.
Have you tried using the profiler to find bottlenecks in the code.
  3 Comments
Daniel Shub
Daniel Shub on 9 Nov 2011
Is this a regression from 2011a to 2011b, or are the improvements in 2011a are not as great as I thought: http://blogs.mathworks.com/steve/2011/05/16/automatic-array-growth-gets-a-lot-faster-in-r2011a/
huda nawaf
huda nawaf on 9 Nov 2011
thanks Daniel,
What version of MATLAB are you using?
matlab7
It looks like arr is growing in your loop.
yes
Prior to r2011a (???) preallocating a variable can speed things up.
how do preallocate and reallocate
Where are the files saved (locally, network drive, flash drive, external harddrive)? A fast internal harddrive will give you the fastest read times.
my files are stored in partition D:\ in my computer
Have you tried using the profiler to find bottlenecks in the code.
please tell me hoe use profile.
this code is very important for me.
thanks

Sign in to comment.

More Answers (1)

Jan
Jan on 9 Nov 2011
Some general advices for improving the speed:
  • One command per line only - otherwise the JIT acceleration looses its power.
  • Avoid dump commands as "len;" - it wastes time.
  • Deleting the 1st two bytes from the file needs a lot of time. Better open the file, read two bytes and call TEXTSCAN afterwards.
  • Close every file as soon as possible properly by fclose(fid). Do not leave all files open until the final fclose('all'). Open files consume resources.
  • Use the vectorizing of fprintf. Instead of for j=1:u1, fprintf(f4,'%d ',arr(i,j)); end prefer fprintf(f4, '%d ', arr(i, :)).
  • Counting the number of non-zero elements in L does not need a loop. Faster: len = sum(L ~= 0);.
  • arr(:, :) = 0 is not useful, because it is equal to a = 0. k is defined twice.
I cannot insert a pre-allocation, because I do not know the maximal possible size of "arr". But this should be faster already:
function wwq
targetdir = 'd:\social net\dataset\netflix\training_set';
targetfiles = '*.txt';
fileinfo = dir(fullfile(targetdir, targetfiles));
arr = 0; % Better pre-allocate
inc = 0;
kk = 0;
y = 1;
for i = 1:length(fileinfo)
thisfilename = fullfile(targetdir, fileinfo(i).name);
f = fopen(thisfilename,'r');
fread(f, 2, 'uint8'); % Skip two bytes
c = textscan(f, '%f %f %s', 'Delimiter', ',', 'headerLines', 1);
fclose(f);
c1 = c{1};
c2 = c{2};
% c3=c{3}; % Not used
z = 1;
% z1 = 1; % Not used
% z2 = 1; % Not used
% z3 = 0; % Not used
kknew = length(c1) + inc;
for k = (1 + kk):kknew % Avoid k as counter *and* in loop index
no = c1(z);
p = find(arr(:, 1) == no);
if isempty(p)
arr(y, 1) = c1(z);
arr(y, 2) = i;
arr(y, 3) = c2(z);
% j = j+3; % Not used
y = y + 1;
else
% ind(i,z1) = p; % Not used
L = arr(p, :);
len = sum(L ~= 0);
arr(p, len + 1) = i;
arr(p, len + 2) = c2(z);
% z1 = z1 + 1; % Not used
end
z = z + 1;
end
kk = kknew;
inc = inc + length(c1);
u = size(arr, 1);
end
f = fopen('netfile.txt','w');
for i = 1:u
fprintf(f, '%d ', arr(i, :));
fprintf(f,'\n');
end
fclose(f);
  2 Comments
huda nawaf
huda nawaf on 10 Nov 2011
thanks Jan,
I try to run the code you wrote it.
but in this part
fread(f, 2, 'uint8'); % Skip two bytes
c = textscan(f, '%f %f %s', 'Delimiter', ',', 'headerLines', 1);
c will return just the second line , i need read from second line to end line
thanks
huda nawaf
huda nawaf on 16 Nov 2011
hi jan
I tried your code, but the same problem.
I tried for just 1000 files, but the running time is very very long may 45 minutes for just 1000 files . what if I run 177000 files.
sparse matrix can solve this problem
thanks

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!