Remove rows/text at the bottom of a csv file

11 views (last 30 days)
Hi,
I have over 2000 csv files and I can read the csv files and store in a cell array. But, all csv files has some text written at the end of the rows (the text is same in all files). How can I delete the text from all files.
Please see the images below. My MATLAB code is shown below.
clear all
cd ('C:\Users\Desktop\')
myFolder = 'C:\Users\Desktop\Q_gte_10';
if ~isdir(myFolder)
errorMessage = sprintf('Error: The following folder does not exist:\n%s', myFolder);
uiwait(warndlg(errorMessage));
return;
end
filePattern = fullfile(myFolder, '*.csv');
csvFiles = dir(filePattern);
for k = 1:length(csvFiles)
fid(k) = fopen(fullfile(myFolder,csvFiles(k).name));
out{k} = textscan(fid(k),'%s %s %s %s %*[^\n]','delimiter',',','headerlines',1);
fclose(fid(k));
end

Accepted Answer

dpb
dpb on 13 Nov 2015
Use
find(strcmp(out{1},'DISCLAIMER'))
to find the location in the cell array where the DISCLAIMER is located and then delete all lines following. Note you'll have to address the CONTENT of the cell array to get the cell string content.
  9 Comments
dpb
dpb on 17 Nov 2015
Well, as I think I've said in the past, I'd chunk the cell arrays altogether. What do you want to do when you've gotten the data for a given file?
But to read these files as you describe above, simply write the formatting string needed something like --
>> fmt='%*s %d %4d-%*2d-%*2d %d %*[^\n]';
>> [id year flow]=textread(d(1).name,fmt,'headerlines',68);
The first in the list was
>> d(1).name
ans =
0204382800-PeakFlow-uptoWY2015.txt
>> [id year flow]
ans =
204382800 1996 302
204382800 1996 350
204382800 1998 592
204382800 1999 1330
204382800 1999 1250
204382800 2001 293
204382800 2002 230
204382800 2003 706
204382800 2004 702
204382800 2005 301
204382800 2006 628
204382800 2006 719
204382800 2008 384
204382800 2009 319
204382800 2009 736
204382800 2010 970
204382800 2012 461
204382800 2012 375
204382800 2014 559
>>
To get 'em all and output these data , simply
fmt='%*s %d %4d-%*2d-%*2d %d %*[^\n]';
for i=1:length:d
[id year flow]=textread(d(i).name,fmt,'headerlines',68);
[path,name,ext]=fileparts(d(i).name);
csvwrite([fullfile(path,name) '.csv'],[id year flow])
end
will leave you with a csv file of the same name containing just the above pieces of data for each with the same root name as the original.
You could do something very similar with the base read as well; if you were to use the above format string with textscan it would abort automagically at the first non-matching line after the data of interest which you could then clean up much more easily than your current gyrations. I added the word "DISCLAIMER" back at the bottom of the first sample file and then did the following --
>> fid=fopen(d(1).name);
>> c=cell2mat(textscan(fid,fmt,'headerlines',68,'collectoutput',1));
>> c(all(c==0,2),:)=[]
c =
204382800 1996 302
204382800 1996 350
204382800 1998 592
204382800 1999 1330
204382800 1999 1250
204382800 2001 293
204382800 2002 230
204382800 2003 706
204382800 2004 702
204382800 2005 301
204382800 2006 628
204382800 2006 719
204382800 2008 384
204382800 2009 319
204382800 2009 736
204382800 2010 970
204382800 2012 461
204382800 2012 375
204382800 2014 559
>> fid=fclose(fid);
>>
The 'collectoutput' argument returns the values in a single cell array and forces an empty cell into all columns; otherwise the first two cell arrays will end up with a zero while the last doesn't owing to the behavior on (the expected) error when hits the trailing text. cell2mat turns it into an "ordinary" double array instead of cell array so indexing is simpler and since there's no need for mixed types here it's much easier (and faster and less memory intensive to boot). Then the last fixup simply removes that line of all zeros making sure if there are zero flow data values (unlikely, yes, but...) don't remove any actual data by check the whole row is 0. You could be a little more brute-force'ish and just write
c(end,:)=[];
counting on only the one row being bogus at the end which should work just as well.
dpb
dpb on 17 Nov 2015
PS: Looks like you should retain at least the month, too, however; there are multiple data points in a given year in some cases that are thus aliased keeping only the year.

Sign in to comment.

More Answers (2)

Image Analyst
Image Analyst on 17 Nov 2015
I'd simply use fgetl(), strfind() and fprintf(), something like
fid = fopen('foo.csv');
fOutput = fopen('outFoo.csv');
tline = fgetl(fid);
while ischar(tline)
disp(tline)
tline = fgetl(fid);
if ~isempty(strfind(tline, 'DISCLAIMER'))
break;
end
fprintf(fOutput, '%s\n', tline);
end
fclose(fid);
fclose(fOutput);
% If you want back in the same file
delete('foo.csv'); % Delete old/input file
movefile('outFoo.csv', 'foo.csv'); % Rename file.

dpb
dpb on 17 Nov 2015
Edited: dpb on 18 Nov 2015
OK, with the depth of the conversation under the original and that now have access to real data file I'm moving the last comment previous and turning it into "that's my answer and I'm stickin' to it!" :)
You can apply the following to either the cleaned-up versions you attached or to the originals--
fmt='%*s %d %4d-%*2d-%*2d %d %*[^\n]';
for i=1:length(d)
fid=fopen(d(i).name);
c=cell2mat(textscan(fid,fmt,'headerlines',68,'collectoutput',1,'delimiter','\t'));
c(all(c==0,2),:)=[]
[path,name,ext]=fileparts(d(i).name);
csvwrite([fullfile(path,name) '.csv'],c)
fid=fclose(fid);
end
will leave you with a csv file of the same name containing just the above pieces of data for each with the same root name as the original.
With textscan it will abort automagically at the first non-matching line after the data of interest which will clean up the input much more easily than your current gyrations.
The 'collectoutput' argument returns the values in a single cell array and forces an empty cell into all columns; otherwise the first two cell arrays will end up with a zero while the last doesn't owing to the behavior on (the expected) error when hits the trailing text.
cell2mat turns it into an "ordinary" double array instead of cell array so indexing is simpler and since there's no need for mixed types here it's much easier (and faster and less memory intensive to boot). Then the last fixup simply removes that line of all zeros making sure if there are zero flow data values (unlikely, yes, but...) don't remove any actual data by the check the whole row is 0.
NB: The tab delimiter is mandatory to account for the missing/empty fields in some files; otherwise by default it'll fail with one of the characters being read where a numeric value is expected. If you chose, you could use the 'EmptyValue' field and return NaN instead of zero to make it obvious where this is occurring.
NB 2: Ran the above on the full directory to make sure nothing unexpected occurred. Looks ok other than the fact that there are sometimes multiple readings in a given year so that it would appear should keep the month as well to avoid aliasing.
  15 Comments
dpb
dpb on 21 Nov 2015
Edited: dpb on 23 Nov 2015
Read both files as cellstring arrays...
U=textread('US_R_05_01_Q.txt','%s','delimiter','\n','whitespace','','headerlines',1);
S=textread('stationList-CurrConditions-9351Stn-2015-06-20.txt','%s','delimiter','\n','whitespace','');
Then locate tabs around second field and parse the field ID from cell array. These locations are the first two tabs per line. Must do this as (I discovered the hard way) that the ID field varies in width significantly from 8 to 14 digits. Thus first return the 2-column array of first and second tab locations in each cellstring that delimits the ID field; the second converts to numeric value...
hdr=S(1); S(1)=[]; % save header line for 'States' file for output later
tab=cellfun(@(x) find(x==9,2),S,'uniform',0);
Ssta=cellfun(@(x,y) str2num(x(y(1)+1:y(2)-1)),S,tab);
tab=cellfun(@(x) find(x==9,2),U,'uniform',0);
Usta=cellfun(@(x,y) str2num(x(y(1)+1:y(2)-1)),U,tab);
[~,istn]=intersect(Ssta,Usta); % find the intersection positions in S
S=S(istn); % and save
fid=fopen('stationsInUS.txt','wt'); % make a new file, name as wish
fprintf(fid,'%s\n',char(hdr)); % write the header line
for i=1:length(S) % have to do each cell string as can't use cell
fprintf(fid,'%s\n',S{i});
end
fid=fclose(fid);
I got
>> length(istn)
ans =
758
>>
intersecting stations. I did it two ways; the above that shows some useful ways to read and manipulate cell arrays and interpret them in memory and by the previously demonstrated technique of parsing the field directly on a read with textread or textscan to make sure had the tab spacing logic correct.
You can use char to convert either U or S above to a 2D character array but in doing so because arrays must be regular the shorter lines are all padded with blanks which doesn't reflect the original file structure exactly. It's also possible to use fread and "suck up" the whole file as a character array, internal control characters and all, including newline but the difficulty in indexing to find the array locations to delete is, while doable, more effort for small files such as these that the time that could be saved by the more efficient read/write operations isn't enough to be worth the coding effort.
Damith
Damith on 2 Dec 2015
So sorry for being late to comment on this. Was sick really bad. I had a a look at this and this seems to be working fine. I checked the outputs and it looks identical to the original file. Thanks so much again.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!