How do I determine the number of data in a 'messy' CSV and then import it?

I have a CSV file that looks like the following:
"","JOHN DOE","p0123TP01, Approved"
"","Rectum (4)(Volume: 57.77)","CTV (21)(Volume: 32.39)"
"0.0","100.0","100.0"
"0.1","78.01999","100.0"
"0.2","76.2245","100.0"
"0.3","75.21035","100.0"
[skipping to the end of the file]
"58.1","","1.88277"
"58.2","","0.21338"
"58.3","","0.0"
but the number of data sets can vary, so I must write a flexible script to accomodate however many data rows and columns exist:
"","DOE, JOHN","revised blurring for Case 16, ANON, original; calculating correlation between blurred dose and toxicity","","","","","","","","","","","","","",""
"","patient (1)(Volume: 18578.06)","Prostate (2)(Volume: 19.80)","Proximal-SV (3)(Volume: 6.40)","Rectum (4)(Volume: 70.44)","Bladder (5)(Volume: 143.70)","Intestine (6)(Volume: 8.40)","CutLine (7)(Volume: 415.77)","Air (8)(Volume: 5.06)","InsideRectum (9)(Volume: 36.17)","PTV1 (10)(Volume: 76.77)","PTVoriginal (11)(Volume: 80.89)","PTV2 (12)(Volume: 56.52)","Guide1 (13)(Volume: 81.79)","Guide2 (14)(Volume: 68.50)","guide (15)(Volume: 66.91)","CTV-operator (16)(Volume: 26.15)"
"0.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.1","12.88921","100.0","100.0","69.70608","73.36377","2.54878","82.87909","16.40558","73.91824","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.2","12.06455","100.0","100.0","67.50839","69.03589","0.88936","79.92729","16.40558","71.92567","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.3","11.56745","100.0","100.0","65.82671","67.72518","0.32793","77.7594","16.40558","70.44129","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.4","11.09599","100.0","100.0","64.40629","66.56731","0.15624","75.78776","16.40558","69.10236","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.5","10.1524","100.0","100.0","63.25727","65.87648","0.07211","73.65644","16.40558","68.12462","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.6","9.48733","100.0","100.0","61.64318","64.92351","0.0103","67.57963","16.40558","65.96868","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.7","9.1257","100.0","100.0","60.35885","62.43443","0.0","59.25954","16.40558","65.12147","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
[... skipping to end of file ...]
"58.1","0.00498","4.57181","0.23601","","","","","","","1.20431","1.143","1.63577","1.13037","1.34968","1.38174","3.51704"
"58.2","4.9E-4","0.46111","0.0","","","","","","","0.1189","0.11285","0.1615","0.1116","0.13326","0.13642","0.34903"
"58.3","0.0","0.0","","","","","","","","0.0","0.0","0.0","0.0","0.0","0.0","0.0"
What textscan formatSpec do I need to store this data? The following code results in a 1x1 cell array with element 0x2 empty double:
function [StudyIDcol,DistributionType,OrganSpecification,Doses,Volumes,DoseUnit,VolumeUnit] = CreateTableVariables(folderpath,type,StudyID)
% First we read the file. Code courtesy of Walter Roberson of the MathWorks community
fid = fopen([folderpath,type,'/',StudyID,'.csv'],'rt'); % rt = 'read in text mode'
if fid == -1 % added for debugging at the suggestion of Jan Hansen of the MathWorks community
error('Author:Function:OpenFile', 'Cannot open file');
end
num = 0;
% headers = cell % unfinished
while true %endless loop requiring 'break' command to exit!
H1 = fgetl(fid) ;
if feof(fid); break; end
H2 = fgetl(fid) ;
if feof(fid); break; end
datacell = textscan(fid, '%f%f', 'Delimiter', ',', 'CollectOutput', true) ; % originally '%f%f'
if isempty(datacell) || isempty(datacell{1}); break; end
% if any(isnan(datacell{1}(end,:))); datacell{1}(end,:) = []; end % commented out to verify the above command quits the loop
num = num + 1;
headers(num,:) = {H1, H2} ;
data(num) = datacell; % We store it to the data cell array only after checking that it contains data.
%fgetl(fid); %the empty line between organs
end
fclose(fid);
I am reading the textscan formatSpec documentation and more trying to solve this problem. Please help me understand how to use textscan.
I am now trying to write a script to flexibly import such files; I must learn how to scan it to count the number of rows and columns for the data beginning at "0.0","100.0","100.0" (not necessarily this number each time) ...
%function data = ReadMIMDVH(filepath)
filepath = '/home/sony/Documents/research/data/MIM Cumulative DVH/planned/1.csv';
% We write a script specifically to import MIM data.
% First we scan the file to see how many rows of data it has.
% We construct the format spec based on this number of rows and columns.
formatSpec = ['%*s %*s %*s %*s %q %q',repmat('%f',[rows,columns])];
% We read the file using textscan.
fileID = fopen(filepath);
DVH = textscan(fileID,formatSpec,'Delimiter',{',' '/n'});
%end

 Accepted Answer

I'd use readtable which should be able to work out the formatspec on its own. So just:
t = readtable(filepath); %all done file is read and parsed.

5 Comments

Thank you for this suggestion. That might work after a bit more work: I must tell it variable names and deal with the header content. The following screenshots show the problems that result without doing so -- incorrect table heading and the table contents are cell arrays rather than double arrays. I am checking the documentation to fine-tune this solution now, to see if I can get it to work.
It's easy to skip the first line and use the second as header:
t = readtable(filepath, 'HeaderLines', 1);
or just skip the first two lines and do not have any header:
t = readtable(filepath, 'HeaderLines', 2, 'ReadVariableNames', false);
Either way, the numbers should then be read properly as numerical data.
edit: No they're not, because they're quoted in the file, indicating they should be read as text. This complicates things a bit:
opts = detectimportoptions(filepath, 'NumHeaderLines', 1); %detect format
opts = setvartype(opts, opts.VariableNames, 'double'); %override type of all columns to double instead of char
t = readtable(filepath, opts) %read file
Here is working code as a solution to the above problem for a messy CSV file named 'larger.csv'.
filepath = 'larger.csv';
opts = detectImportOptions(filepath, 'NumHeaderLines', 1);
opts = setvartype(opts,opts.VariableNames, 'double');
opts.VariableNames(1)={'Dose'};
DVH = readtable(filepath,opts);
My next task is to learn how to parse the Volume variable names to strip the underscore and all characters after it.
My next task is to learn how to parse the Volume variable names to strip the underscore and all characters after it.
A regexprep should do that very easily. Explain exactly what you want.
Please see the question, " How do I parse and erase from a string while importing CSV with tableread?" The goal is to most efficiently import the volume structure names, e.g.
"Rectum (4)(Volume: 70.44)","Bladder (5)(Volume: 143.70)"
keeping only the names 'Rectum', 'Bladder', etc. I was able to find one method, but it took four lines of code, and I suspect there is a way to do it with one. If you would post a better method, I would unaccept my answer and accept yours.

Sign in to comment.

More Answers (1)

opts = detectImportOptions(filepath);
nvar = length(opts.VariableNames);
opts.VariableNames = sprintfc('Var%d', 1:nvar);
test = readtable(filepath, opts);

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!