How do I extract sections of data from a csv file?

I am having difficulty extracting the data I require from a csv file. I have been provided with a csv file which has the outputs of a number of simulations. However I have hit a dead-end on how to extract the data without doing it individually, i.e. opening in the import window and selecting the range of cells I wish to import by altering the range selected and doing this for each simulation. The format of the csv is attached, the output comes in the form of a table where one simulation is produced with the given headings along the top and the simulated results in the columns below. The next simulation is then produced below in the same form. Ideally I would want to input all the data directly into vectors, e.g. the year data into an 11x10 vector, with each column holding the relevant data for each simulation, i.e the first column of the vector holding A3:A13, the second column holding A17:A27 and so on. Any advice on how to extract the data would be greatly appreciated. Thanks in advance.

3 Comments

I cannot guess if the pile of commas have a meaning and in the number of terms in this line has any meaning: "
END,END,END,END,END,END,END,END,END,END,END,END,END,END
Perhaps the order of characters in this line can change or is not important:
Year,Q ,I ,W ,J ,Y ,DM ,K ,D ,P ,CM ,C ,B ,R
There are too many details which must be guessed such that posting a matching answer is very unlikely based on the currently provided information. What exactly does this mean: "without doing it individually"?
Please define the underlying pattern of the file accurately. Why do you assume, that the "Year" can be extracted as a 10x10 matrix - when you have 10 "measurements" with 11 lines?
Sorry, I've tried to update it too be more clear. I just attached one of the files I had and didn't realise that the format you viewed it would be different from the excel format I was viewing it with. The 10x10 was a mistake. I hope this helps and thank you for taking the time to look at it.
See Answer below--there's really nothing to worry about regarding the format/interpretation as far as I can see--looks like a machine-generated csv file with a blank line after the last set of values for each year and the explicit commas for each field irrespective of data in the field or not for the title line.
There is something a little funky as is often the case in me experience using textscan--I had to insert an extra fgetl to get the file pointer to the next line after the initial section read; trying the loop w/o got off somehow...

Sign in to comment.

 Accepted Answer

dpb
dpb on 18 Jun 2015
Edited: dpb on 18 Jun 2015
d=[]; % initialize an array for the data
fmt=repmat('%f',1,14); % format string to match file
fid=fopen('filename');
for i=1:10
d=[d;cell2mat(textscan(fid,fmt,12,'headerlines',2, ...
'collectoutput',1 ...
'delimiter',','))]; % read each section;concatenate
fgetl(fid) % had to do this to get synchronized again...find it often
end
fid=fclose(fid);
The above returns the data in one array; if you instead want each simulation separately, instead of the concatenation above use a cell array to store each read section--
ERRATUM
fmt=[repmat('%f',1,14) '\n']; % format string to match line of file data
fid=fopen('filename');
for i=1:10 % repeat for all sections in file
d(i)=textscan(fid,fmt,12,'headerlines',2, ...
'collectoutput',1 ...
'delimiter',',')]; % read each section into cell array
end
fid=fclose(fid);
NB: The \n (newline) in the format string solves the position in the file it seems. I guess it's one of those cases where it's probably technically wrong without it but the scanning routines skip over it transparently in the vectorized portion but not when the scan is started over. Again, "why" is indecipherable as far as I can tell, it "just is".

8 Comments

Thanks for the advice. I believe the second section where you are putting the data into individual arrays is more what I am after. However, I am having some issues getting it to run. When I run it j=1, fid =0 and d is a 1x1 cell which is empty. I have attempted to read and understand what you have done, and now I feel that I understand what all the sections are doing but I am no closer to extracting the data. Some things I have tried include changing the permission of the fopen since there was issues with the textscan and I tried adding a frewind since I thought that it wasn't reading. Anything that you could add to explain where I am going wrong with your code would be great.
That's not going to solve your problem, but I wouldn't use:
fid = fclose(fid);
because that's going to set fid to 0 if the fclose is successful. I was puzzled at first why your fid would be 0 as while that's a valid file identifier (standard input) it's not one you should be reading from. Use
state = fclose(fid); %or just plain: fclose(fid);
instead.
If d is just an empty cell, then textscan did not succesfully parse the file.
dpb
dpb on 18 Jun 2015
Edited: dpb on 19 Jun 2015
@G -- I use the paradigm fid=fclose(fid); so that I can perhaps later test its status and if simply use fclose(fid) then fid is still also a valid handle value (and worse, besides being not a reserved handle value but one that looks like it could still be a useful one). I think it's a weakness in Matlab design; there isn't a really a good, elegant way to handle it.
But, it has nothing to do w/ OP's problem; that's long after everything else has already happened. --dpb
@Andrew -- I don't know where you're going wrong; seems to work ok for me here...
>> fmt=repmat('%f',1,14);
>> fid=fopen('10sims10yrs.csv','rt');
>> for i=1:10
d(i)=textscan(fid,fmt,12,'collectoutput',1,'delimiter',',','headerlines',2);
fgetl(fid);
end
>> d
d =
Columns 1 through 5
[12x14 double] [12x14 double] [12x14 double] [12x14 double] [12x14 double]
Columns 6 through 10
[12x14 double] [12x14 double] [12x14 double] [12x14 double] [12x14 double]
>> d{end}(end-1,end)
ans =
0.0152
>> fid=fclose(fid);
>>
That looks to be the correct value when I look at the file in Excel or a text editor for the last non-blank line of the last data set (to default display precision, of course)
I can't answer the question of why a fgetl is needed; I guess because the last read is satisfied before the newline but you'd think if that were an issue it would be needed for the previous 11 lines as well in the repeat count of 12. I never will fully comprehend the arcane nether depths of C i/o; it is, as far as I can tell, impenetrable.
@dpb, Yes it's off topic here but my first thought when Andrew said that his fid was 0 is he somehow managed to get fopen to open the standard input. Only afterward did I realise that you overwrote the fid on close. To me that makes no sense, fclose does not return a file identifier, just success or failure.
To get back slightly on topic, a good practice when using fopen (in a function) is to also use onCleanup which does the fclose. This ensures that the file is closed even if the code errors out at some point:
fid = fopen('somefile');
if fid > -1
cleanup = OnCleanup(@() fclose(fid));
end
%no need for fclose anymore, the file will be close when the function terminates (or when you clear cleanup)
dpb
dpb on 18 Jun 2015
Edited: dpb on 21 Jun 2015
"...fclose does not return a file identifier, just success or failure." Indeed, but fid is the variable used as a file handle and there's no way in Matlab parlance to tell whether it's a valid one or not. There's ishandle for graphics objects but no similar for file handles. Hence, I use the expedient of that it's not >2 as a workaround for the lacking ability otherwise.
ADDENDUM
I'll note that when I first began using Matlab lo! those many years ago I used to write fid=fclose(fid)-1; instead. This prevented the value ever being indistinguishable from a valid handle including the predefined ones but in practice over time it was so rarely a case that I ever actually cared for anything other than simply whether the handle was still associated with an external file or not that I dropped back to the idiom of just stuffing the status value into the variable. Rarely does it come into play that logic is written that really uses it but it just became a habit not leaving a value around that looks like it could still be ok for an external file but isn't that. And that's my story and I'm stickin' to it! :)
END ADDENDUM
The cleanup route is one way and is good for production code but is a lot of extra effort for simple scripts and "throwaway" quick code for such purposes as here. And, of course, when working at the command line the cleanup routine never gets called.
Thank you very much for the help guys, I've got it working this morning. It has been a great help. Definitely learnt some new commands for MatLab.
dpb
dpb on 19 Jun 2015
Edited: dpb on 19 Jun 2015
No problem....if I had to guess formatted input for particular file structure is quite possibly the number one question...there are so many apparent possibilities and such a number of nuances to almost every one of them as to make it almost overwhelming to the initiate...

Sign in to comment.

More Answers (0)

Asked:

on 18 Jun 2015

Edited:

dpb
on 21 Jun 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!