How do I determine the number of data in a 'messy' CSV and then import it?

Question

0 votes

I have a CSV file that looks like the following:

"","JOHN DOE","p0123TP01, Approved"
"","Rectum (4)(Volume: 57.77)","CTV (21)(Volume: 32.39)"
"0.0","100.0","100.0"
"0.1","78.01999","100.0"
"0.2","76.2245","100.0"
"0.3","75.21035","100.0"
[skipping to the end of the file]
"58.1","","1.88277"
"58.2","","0.21338"
"58.3","","0.0"

but the number of data sets can vary, so I must write a flexible script to accomodate however many data rows and columns exist:

"","DOE, JOHN","revised blurring for Case 16, ANON, original; calculating correlation between blurred dose and toxicity","","","","","","","","","","","","","",""
"","patient (1)(Volume: 18578.06)","Prostate (2)(Volume: 19.80)","Proximal-SV (3)(Volume: 6.40)","Rectum (4)(Volume: 70.44)","Bladder (5)(Volume: 143.70)","Intestine (6)(Volume: 8.40)","CutLine (7)(Volume: 415.77)","Air (8)(Volume: 5.06)","InsideRectum (9)(Volume: 36.17)","PTV1 (10)(Volume: 76.77)","PTVoriginal (11)(Volume: 80.89)","PTV2 (12)(Volume: 56.52)","Guide1 (13)(Volume: 81.79)","Guide2 (14)(Volume: 68.50)","guide (15)(Volume: 66.91)","CTV-operator (16)(Volume: 26.15)"
"0.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.1","12.88921","100.0","100.0","69.70608","73.36377","2.54878","82.87909","16.40558","73.91824","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.2","12.06455","100.0","100.0","67.50839","69.03589","0.88936","79.92729","16.40558","71.92567","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.3","11.56745","100.0","100.0","65.82671","67.72518","0.32793","77.7594","16.40558","70.44129","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.4","11.09599","100.0","100.0","64.40629","66.56731","0.15624","75.78776","16.40558","69.10236","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.5","10.1524","100.0","100.0","63.25727","65.87648","0.07211","73.65644","16.40558","68.12462","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.6","9.48733","100.0","100.0","61.64318","64.92351","0.0103","67.57963","16.40558","65.96868","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.7","9.1257","100.0","100.0","60.35885","62.43443","0.0","59.25954","16.40558","65.12147","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
[... skipping to end of file ...]
"58.1","0.00498","4.57181","0.23601","","","","","","","1.20431","1.143","1.63577","1.13037","1.34968","1.38174","3.51704"
"58.2","4.9E-4","0.46111","0.0","","","","","","","0.1189","0.11285","0.1615","0.1116","0.13326","0.13642","0.34903"
"58.3","0.0","0.0","","","","","","","","0.0","0.0","0.0","0.0","0.0","0.0","0.0"

What textscan formatSpec do I need to store this data? The following code results in a 1x1 cell array with element 0x2 empty double:

function [StudyIDcol,DistributionType,OrganSpecification,Doses,Volumes,DoseUnit,VolumeUnit] = CreateTableVariables(folderpath,type,StudyID)
% First we read the file. Code courtesy of Walter Roberson of the MathWorks community
   fid = fopen([folderpath,type,'/',StudyID,'.csv'],'rt'); % rt = 'read in text mode'
   if fid == -1 % added for debugging at the suggestion of Jan Hansen of the MathWorks community
      error('Author:Function:OpenFile', 'Cannot open file');
   end
   num = 0;
 %  headers = cell % unfinished
   while true %endless loop requiring 'break' command to exit!
     H1 = fgetl(fid) ;
     if feof(fid); break; end 
     H2 = fgetl(fid) ;
     if feof(fid); break; end 
     datacell = textscan(fid, '%f%f', 'Delimiter', ',', 'CollectOutput', true) ; % originally '%f%f'
     if isempty(datacell) || isempty(datacell{1}); break; end 
   %  if any(isnan(datacell{1}(end,:))); datacell{1}(end,:) = []; end % commented out to verify the above command quits the loop
     num = num + 1;
     headers(num,:) = {H1, H2} ;
     data(num) = datacell; % We store it to the data cell array only after checking that it contains data.
     %fgetl(fid);  %the empty line between organs
   end
   fclose(fid);

I am reading the textscan formatSpec documentation and more trying to solve this problem. Please help me understand how to use textscan.

I am now trying to write a script to flexibly import such files; I must learn how to scan it to count the number of rows and columns for the data beginning at "0.0","100.0","100.0" (not necessarily this number each time) ...

%function data = ReadMIMDVH(filepath)
filepath = '/home/sony/Documents/research/data/MIM Cumulative DVH/planned/1.csv';
% We write a script specifically to import MIM data.
% First we scan the file to see how many rows of data it has.
% We construct the format spec based on this number of rows and columns.
formatSpec = ['%*s %*s %*s %*s %q %q',repmat('%f',[rows,columns])];
% We read the file using textscan.
fileID = fopen(filepath);
DVH = textscan(fileID,formatSpec,'Delimiter',{',' '/n'});
%end

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Guillaume on 22 Jan 2018

Open in MATLAB Online

0 votes

I'd use readtable which should be able to work out the formatspec on its own. So just:

t = readtable(filepath); %all done file is read and parsed.

5 Comments
Show 3 older comments Hide 3 older comments

Guillaume on 23 Jan 2018

My next task is to learn how to parse the Volume variable names to strip the underscore and all characters after it.

A regexprep should do that very easily. Explain exactly what you want.

Daniel Bridges on 23 Jan 2018

Edited: Daniel Bridges on 23 Jan 2018

Open in MATLAB Online

Please see the question, " How do I parse and erase from a string while importing CSV with tableread?" The goal is to most efficiently import the volume structure names, e.g.

"Rectum (4)(Volume: 70.44)","Bladder (5)(Volume: 143.70)"

keeping only the names 'Rectum', 'Bladder', etc. I was able to find one method, but it took four lines of code, and I suspect there is a way to do it with one. If you would post a better method, I would unaccept my answer and accept yours.

Sign in to comment.

Answer 2

Walter Roberson on 22 Jan 2018

Open in MATLAB Online

1 vote

opts = detectImportOptions(filepath);
nvar = length(opts.VariableNames);
opts.VariableNames = sprintfc('Var%d', 1:nvar);
test = readtable(filepath, opts);

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

How do I determine the number of data in a 'messy' CSV and then import it?

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

5 Comments
Show 3 older comments Hide 3 older comments

More Answers (1)

0 Comments
Show -2 older comments Hide -2 older comments

Categories

Products

Tags

Community Treasure Hunt

How do I determine the number of data in a 'messy' CSV and then import it?

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

5 Comments Show 3 older comments Hide 3 older comments

More Answers (1)

0 Comments Show -2 older comments Hide -2 older comments

Categories

Products

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

5 Comments
Show 3 older comments Hide 3 older comments

0 Comments
Show -2 older comments Hide -2 older comments