Remove rows/text at the bottom of a csv file

Question

Damith on 13 Nov 2015

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/254712-remove-rows-text-at-the-bottom-of-a-csv-file

Commented: Tunechi on 8 May 2021

Hi,

I have over 2000 csv files and I can read the csv files and store in a cell array. But, all csv files has some text written at the end of the rows (the text is same in all files). How can I delete the text from all files.

Please see the images below. My MATLAB code is shown below.

clear all
cd ('C:\Users\Desktop\')
myFolder = 'C:\Users\Desktop\Q_gte_10';
if ~isdir(myFolder)
  errorMessage = sprintf('Error: The following folder does not exist:\n%s', myFolder);
  uiwait(warndlg(errorMessage));
  return;
end
filePattern = fullfile(myFolder, '*.csv');
csvFiles = dir(filePattern);
for k = 1:length(csvFiles)
  fid(k) = fopen(fullfile(myFolder,csvFiles(k).name));
  out{k} = textscan(fid(k),'%s %s %s %s %*[^\n]','delimiter',',','headerlines',1);
  fclose(fid(k));
end

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

dpb on 13 Nov 2015

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/254712-remove-rows-text-at-the-bottom-of-a-csv-file#answer_199745

Open in MATLAB Online

Use

find(strcmp(out{1},'DISCLAIMER'))

to find the location in the cell array where the DISCLAIMER is located and then delete all lines following. Note you'll have to address the CONTENT of the cell array to get the cell string content.

9 Comments
Show 7 older commentsHide 7 older comments

dpb on 17 Nov 2015

Open in MATLAB Online

As the doc linked to shows, use fprintf for the specific data as you have it and wish it output.

The data as you have it is an array of cellstr 65x1, the "columns"are internal embedded blanks/tabs it appears. You'll have to parse that data either by modifying your reading routine to actually return the values in the file as variables of the proper type instead as a string or, alternatively, pass an expression such as the above to textscan and parse it from memory instead of from disk once you do have the character cellstr array as shown. That would likely look something like

fmt='%s %s %s %d %*[^\n]';
c=textscan(char(out{1,3}{1}),fmt);

There appears to be some "issues" with that data, however as there's a missing value in col 4 for the first row it appears at a minimum and there's a bunch of irregularities going further after that. Those will undoubtedly cause problems, but that's the general step. You could alternatively it looks like have a chance to select the columns desired by using fixed-width field widths and column indexing if you convert to a character string array with char(). If there's a case where the fourth column is less than three digits and has a trailing value on that line similar to the '410 2' about halfway down, however, then that trailing '2' would show up in the last column position of the rest of the column and you'd be SOL (to use the technical term :) ).

I couldn't infer what your original data source really looked like to get a handle on why you were going thru the machinations you were/are on the input scanning, but there's undoubtedly a cleaner way. Can you attach a small sample file that illustrates the structure?

Damith on 17 Nov 2015

Edited: Damith on 17 Nov 2015

Open in MATLAB Online

Region-03.zip

Hi,

Thanks for your reply.

I have attached all the txt files (315) and the structure is the same but data lengths are different. My MATLAB code showing below reads the 315 txt files and removes the comment lines (#1 to #66) in each cell.

What I want is to separate the stationID, Year and flow into 3 columns?

output= 
02053200   1929   4050
02053200   1940   7000
.
.
.

That's why I am trying to convert the cell contents into columns.

Hope this helps.

Thanks in advance.

clear all
cd ('C:\Users\Desktop\')
myFolder = 'C:\Users\Desktop\flow_raw\Region-03';
if ~isdir(myFolder)
  errorMessage = sprintf('Error: The following folder does not exist:\n%s', myFolder);
  uiwait(warndlg(errorMessage));
  return;
end
filePattern = fullfile(myFolder, '*.txt');
csvFiles = dir(filePattern);
for k = 1:length(csvFiles)
  fid(k) = fopen(fullfile(myFolder,csvFiles(k).name));
  out{k} = textscan(fid(k),'%s %*[^\n]','delimiter',',','headerlines',2);
  fclose(fid(k));
end
% Remove rows from 1 to 66 in each cell
for jj=1:length(csvFiles)
out{jj}{1,1}(1:66)=[];
end

dpb on 17 Nov 2015

Open in MATLAB Online

Well, as I think I've said in the past, I'd chunk the cell arrays altogether. What do you want to do when you've gotten the data for a given file?

But to read these files as you describe above, simply write the formatting string needed something like --

>> fmt='%*s %d %4d-%*2d-%*2d %d %*[^\n]';
>> [id year flow]=textread(d(1).name,fmt,'headerlines',68);

The first in the list was

>> d(1).name
ans =
0204382800-PeakFlow-uptoWY2015.txt
>> [id year flow]
ans =
 204382800        1996         302
 204382800        1996         350
 204382800        1998         592
 204382800        1999        1330
 204382800        1999        1250
 204382800        2001         293
 204382800        2002         230
 204382800        2003         706
 204382800        2004         702
 204382800        2005         301
 204382800        2006         628
 204382800        2006         719
 204382800        2008         384
 204382800        2009         319
 204382800        2009         736
 204382800        2010         970
 204382800        2012         461
 204382800        2012         375
 204382800        2014         559
>>

To get 'em all and output these data , simply

fmt='%*s %d %4d-%*2d-%*2d %d %*[^\n]';
for i=1:length:d
  [id year flow]=textread(d(i).name,fmt,'headerlines',68);
  [path,name,ext]=fileparts(d(i).name);
  csvwrite([fullfile(path,name) '.csv'],[id year flow])
end

will leave you with a csv file of the same name containing just the above pieces of data for each with the same root name as the original.

You could do something very similar with the base read as well; if you were to use the above format string with textscan it would abort automagically at the first non-matching line after the data of interest which you could then clean up much more easily than your current gyrations. I added the word "DISCLAIMER" back at the bottom of the first sample file and then did the following --

>> fid=fopen(d(1).name);
>> c=cell2mat(textscan(fid,fmt,'headerlines',68,'collectoutput',1));
>> c(all(c==0,2),:)=[]
c =
 204382800        1996         302
 204382800        1996         350
 204382800        1998         592
 204382800        1999        1330
 204382800        1999        1250
 204382800        2001         293
 204382800        2002         230
 204382800        2003         706
 204382800        2004         702
 204382800        2005         301
 204382800        2006         628
 204382800        2006         719
 204382800        2008         384
 204382800        2009         319
 204382800        2009         736
 204382800        2010         970
 204382800        2012         461
 204382800        2012         375
 204382800        2014         559
>> fid=fclose(fid);
>>

The 'collectoutput' argument returns the values in a single cell array and forces an empty cell into all columns; otherwise the first two cell arrays will end up with a zero while the last doesn't owing to the behavior on (the expected) error when hits the trailing text. cell2mat turns it into an "ordinary" double array instead of cell array so indexing is simpler and since there's no need for mixed types here it's much easier (and faster and less memory intensive to boot). Then the last fixup simply removes that line of all zeros making sure if there are zero flow data values (unlikely, yes, but...) don't remove any actual data by check the whole row is 0. You could be a little more brute-force'ish and just write

c(end,:)=[];

counting on only the one row being bogus at the end which should work just as well.

dpb on 17 Nov 2015

PS: Looks like you should retain at least the month, too, however; there are multiple data points in a given year in some cases that are thus aliased keeping only the year.

Sign in to comment.

Answer 2

Image Analyst on 17 Nov 2015

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/254712-remove-rows-text-at-the-bottom-of-a-csv-file#answer_200090

Open in MATLAB Online

I'd simply use fgetl(), strfind() and fprintf(), something like

fid = fopen('foo.csv');
fOutput = fopen('outFoo.csv');
tline = fgetl(fid);
while ischar(tline)
    disp(tline)
    tline = fgetl(fid);
    if ~isempty(strfind(tline, 'DISCLAIMER'))
        break;
    end
    fprintf(fOutput, '%s\n', tline);
end
fclose(fid);
fclose(fOutput);
% If you want back in the same file
delete('foo.csv'); % Delete old/input file
movefile('outFoo.csv', 'foo.csv'); % Rename file.

1 Comment
Show -1 older commentsHide -1 older comments

Tunechi on 8 May 2021

Thanks for this @Image Analyst

Sign in to comment.

Answer 3

dpb on 17 Nov 2015

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/254712-remove-rows-text-at-the-bottom-of-a-csv-file#answer_200166

Edited: dpb on 18 Nov 2015

Open in MATLAB Online

OK, with the depth of the conversation under the original and that now have access to real data file I'm moving the last comment previous and turning it into "that's my answer and I'm stickin' to it!" :)

You can apply the following to either the cleaned-up versions you attached or to the originals--

fmt='%*s %d %4d-%*2d-%*2d %d %*[^\n]';
for i=1:length(d)
  fid=fopen(d(i).name);
  c=cell2mat(textscan(fid,fmt,'headerlines',68,'collectoutput',1,'delimiter','\t'));
  c(all(c==0,2),:)=[]
  [path,name,ext]=fileparts(d(i).name);
  csvwrite([fullfile(path,name) '.csv'],c)
  fid=fclose(fid);
end

will leave you with a csv file of the same name containing just the above pieces of data for each with the same root name as the original.

With textscan it will abort automagically at the first non-matching line after the data of interest which will clean up the input much more easily than your current gyrations.

The 'collectoutput' argument returns the values in a single cell array and forces an empty cell into all columns; otherwise the first two cell arrays will end up with a zero while the last doesn't owing to the behavior on (the expected) error when hits the trailing text.

cell2mat turns it into an "ordinary" double array instead of cell array so indexing is simpler and since there's no need for mixed types here it's much easier (and faster and less memory intensive to boot). Then the last fixup simply removes that line of all zeros making sure if there are zero flow data values (unlikely, yes, but...) don't remove any actual data by the check the whole row is 0.

NB: The tab delimiter is mandatory to account for the missing/empty fields in some files; otherwise by default it'll fail with one of the characters being read where a numeric value is expected. If you chose, you could use the 'EmptyValue' field and return NaN instead of zero to make it obvious where this is occurring.

NB 2: Ran the above on the full directory to make sure nothing unexpected occurred. Looks ok other than the fact that there are sometimes multiple readings in a given year so that it would appear should keep the month as well to avoid aliasing.

15 Comments
Show 13 older commentsHide 13 older comments

Damith on 18 Nov 2015

Edited: Damith on 18 Nov 2015

Open in MATLAB Online

Hi,

Thanks again for your detailed explanation and reply. I learnt few things which I did not earlier from your answer.

So, I ran your code with minor modifications. But, when I ran your code for

i=1:1

it ran perfect but when I change

i=1:2  or 
i=1:length(csvFiles)

I got an error.

Error using textscan
Invalid file identifier.  Use fopen to generate a valid file identifier.

Cannot understand why? appreciate if you can guide me to figure this out.

Thanks a lot again.

clear all
cd ('C:\Users\Desktop\flow_raw\Region-03')
myFolder = 'C:\Users\Desktop\flow_raw\Region-03';
if ~isdir(myFolder)
  errorMessage = sprintf('Error: The following folder does not exist:\n%s', myFolder);
  uiwait(warndlg(errorMessage));
  return;
end
filePattern = fullfile(myFolder, '*.txt');
csvFiles = dir(filePattern);
fmt='%*s %d %4d-%*2d-%*2d %*s %d %*[^\n]';
for i=1:length(csvFiles)
  fid(i) = fopen(fullfile(myFolder,csvFiles(i).name));
  c=cell2mat(textscan(fid,fmt,'headerlines',68,'collectoutput',1,'delimiter','\t'));
  c(all(c==0,2),:)=[];
  [path,name,ext]=fileparts(csvFiles(i).name);
  csvwrite([fullfile(path,csvFiles(i).name) '.csv'],c)
end

Damith on 20 Nov 2015

Open in MATLAB Online

Yes, tried with the link you provided earlier but not successful with 'writetable' function.

Please see my MATLAB code below.

clear all
% Read the original text file
% This reads StationList-CurrConditions-9351Stn-2015-06-20.txt file
% which has 9351 rows
myFolder = 'C:\Users\Desktop\US_PeakQ_Download';
if ~isdir(myFolder)
  errorMessage = sprintf('Error: The following folder does not exist:\n%s', myFolder);
  uiwait(warndlg(errorMessage));
  return;
end
filePattern = fullfile(myFolder, '*.txt');
csvFiles = dir(filePattern);
for k = 1:length(csvFiles)
  fid(k) = fopen(fullfile(myFolder,csvFiles(k).name));
  out{k} = textscan(fid(k),'%s %d %s %s %s %s %s %s %d %d %s %d %d %d %s %s %d %*[^\n]','delimiter','\t','headerlines',1);
  fclose(fid(k));
end
% Read GIS layer regional station IDs
% This reads US_R_05_01_Q.txt file attached
myFolder = 'C:\Users\Desktop\US_PeakQ_Download\US_R_05_01_Q';
if ~isdir(myFolder)
  errorMessage = sprintf('Error: The following folder does not exist:\n%s', myFolder);
  uiwait(warndlg(errorMessage));
  return;
end
filePattern = fullfile(myFolder, '*.txt');
csvFiles = dir(filePattern);
  fmt2='%*s %d %*[^\n]';
  fid2 = fopen(fullfile(myFolder,csvFiles.name));
  d=cell2mat(textscan(fid2,fmt2,'headerlines',1,'collectoutput',1,'delimiter','\t'));
  [path,name,ext]=fileparts(csvFiles.name);  
  fid2=fclose(fid2);
    % Identify the common station IDs
    com=intersect(d,out{1,1}{1,2},'rows');
    I = ~ismember(out{1,1}{1,2}, com);
for jj=1:1:cellfun(@(x) numel(x),out) 
    out{1,1}{1,jj}(I,:) = [];
end

dpb on 21 Nov 2015

Edited: dpb on 23 Nov 2015

Open in MATLAB Online

Read both files as cellstring arrays...

U=textread('US_R_05_01_Q.txt','%s','delimiter','\n','whitespace','','headerlines',1);
S=textread('stationList-CurrConditions-9351Stn-2015-06-20.txt','%s','delimiter','\n','whitespace','');

Then locate tabs around second field and parse the field ID from cell array. These locations are the first two tabs per line. Must do this as (I discovered the hard way) that the ID field varies in width significantly from 8 to 14 digits. Thus first return the 2-column array of first and second tab locations in each cellstring that delimits the ID field; the second converts to numeric value...

hdr=S(1); S(1)=[];  % save header line for 'States' file for output later
tab=cellfun(@(x) find(x==9,2),S,'uniform',0);
Ssta=cellfun(@(x,y) str2num(x(y(1)+1:y(2)-1)),S,tab);
tab=cellfun(@(x) find(x==9,2),U,'uniform',0);
Usta=cellfun(@(x,y) str2num(x(y(1)+1:y(2)-1)),U,tab);
[~,istn]=intersect(Ssta,Usta);  % find the intersection positions in S
S=S(istn);  % and save
fid=fopen('stationsInUS.txt','wt');  % make a new file, name as wish
fprintf(fid,'%s\n',char(hdr));       % write the header line
for i=1:length(S)  % have to do each cell string as can't use cell
  fprintf(fid,'%s\n',S{i});
end
fid=fclose(fid);

I got

>> length(istn)
ans =
 758
>>

intersecting stations. I did it two ways; the above that shows some useful ways to read and manipulate cell arrays and interpret them in memory and by the previously demonstrated technique of parsing the field directly on a read with textread or textscan to make sure had the tab spacing logic correct.

You can use char to convert either U or S above to a 2D character array but in doing so because arrays must be regular the shorter lines are all padded with blanks which doesn't reflect the original file structure exactly. It's also possible to use fread and "suck up" the whole file as a character array, internal control characters and all, including newline but the difficulty in indexing to find the array locations to delete is, while doable, more effort for small files such as these that the time that could be saved by the more efficient read/write operations isn't enough to be worth the coding effort.

Damith on 2 Dec 2015

So sorry for being late to comment on this. Was sick really bad. I had a a look at this and this seems to be working fine. I checked the outputs and it looks identical to the original file. Thanks so much again.

Sign in to comment.

Remove rows/text at the bottom of a csv file

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

9 Comments
Show 7 older commentsHide 7 older comments

More Answers (2)

1 Comment
Show -1 older commentsHide -1 older comments

15 Comments
Show 13 older commentsHide 13 older comments

See Also

Categories

Tags

Community Treasure Hunt

Remove rows/text at the bottom of a csv file

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

9 Comments Show 7 older commentsHide 7 older comments

More Answers (2)

1 Comment Show -1 older commentsHide -1 older comments

15 Comments Show 13 older commentsHide 13 older comments

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

9 Comments
Show 7 older commentsHide 7 older comments

1 Comment
Show -1 older commentsHide -1 older comments

15 Comments
Show 13 older commentsHide 13 older comments