How to read rows of data in .csv file and turn them into arrays?

Question

pistachionut on 8 Jan 2020

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/499522-how-to-read-rows-of-data-in-csv-file-and-turn-them-into-arrays

Edited: per isakson on 10 Jan 2020

20200107LaAlO3.csv

I have some data in a csv file that consists of some "delays" and for each delay there is a row of "wavelengths" and a row of corresponsing "intensities". It looks like this in excel:

I want the code to take all the delays and make them into one vector. And take the wavelengths (all the wavelength rows are identical) and make them into one vector. And take the intensities and make an array of intensities.

I tried textscan but that seems column oriented. I need matlab to read line by line (row by row), extract the data and put it into relevant array and then repeat procedure for every delay.

3 Comments
Show 1 older commentHide 1 older comment

the cyclist on 8 Jan 2020

It's better to upload the file (or a small, representative sample of the file). The image helps, but we cannot test code on it.

pistachionut on 8 Jan 2020

Thanks! I uploaded a part of the file

Sign in to comment.

Sign in to answer this question.

Answer 1

Stephen23 on 9 Jan 2020

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/499522-how-to-read-rows-of-data-in-csv-file-and-turn-them-into-arrays#answer_409356

Edited: Stephen23 on 9 Jan 2020

Open in MATLAB Online

20200107LaAlO3.csv

Notes to consider when importing a large file:

Importing as character and then doing any kind of text manipulation before converting to numeric will be very inefficient for a large file.
Importing the file in blocks is only efficient for a relatively small number of very large data blocks. This is not the case with your data: your file has many small blocks, forcing you to access the file repeatedly and expand the output array many times.
Regardless of what method you use to import it, the entire imported file data must fit into MATLAB memory. If this is not the case then you will need to use tall arrays or some other approach to processing your data.
Regardless of what method you use to import it, importing data will take some time.
If you want to import a large file efficiently then you need to minimize any handling of the imported file data, the optimal solution is to get all of the importing and conversion to numeric performed in one call to one reasonably low-level MATLAB command (i.e. avoid readtable , readmatrix, et al, which are just complex wrappers around textscan).

Method one: textscan. With suitable options to read that file format (note how ':' is defined as the EOL character, so each data block become one line of the imported matrix), you only need to call textscan once and it will import the entire file:

[fid,msg] = fopen('20200107LaAlO3.csv','rt');
assert(fid>=3,msg)
%% Detect the number of columns:
fgetl(fid); % ignore first line.
nmc = 1+nnz(fgetl(fid)==','); % number of columns.
fmt = ['%f%f',repmat('%f',1,2*nmc),'%*s'];
%% Import the file data:
opt = {'HeaderLines',1, 'EndOfLine',':', 'MultipleDelimsAsOne',true,...
    'Whitespace',', \n\r\t', 'CollectOutput',true}; % COLLECTOUTPUT is optional
frewind(fid)
C = textscan(fid,fmt,opt{:});
fclose(fid);
%% Optional: split into separate arrays:
D = C{1}(:,1:2); % Delays
W = C{1}(:,3:2+nmc); % Wavelengths
I = C{1}(:,3+nmc:end); % Intensities

And checking the output:

>> D % Delays
D =
          -20      -19.999
      -19.985      -19.982
       -19.97      -19.967
      -19.955      -19.952
       -19.94      -19.937
      -19.925      -19.922
       -19.91      -19.907
      -19.895      -19.892
       -19.88      -19.877
      -19.865      -19.862
>> W(:,[1:5,end-5:end]) % Wavelengths (first and last few columns only)
ans =
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
       189.25       189.71       190.18       190.65       191.12       1098.3       1098.7       1099.1       1099.5       1099.9       1100.4
>> I(:,[1:5,end-5:end]) % Intensities (first and last few columns only)
ans =
        3802        2566        2801        2749        2918        2938        2931        2756        2760        2758        2756
        3805        2570        2806        2748        2922        2944        2917        2754        2755        2755        2754
        3806        2569        2800        2751        2925        2960        2918        2757        2760        2758        2756
        3805        2564        2802        2747        2924        2946        2915        2754        2757        2753        2757
        3804        2568        2801        2744        2928        2937        2914        2754        2759        2758        2760
        3805        2568        2807        2748        2935        2939        2928        2752        2762        2756        2756
        3802        2569        2802        2752        2933        2936        2910        2755        2758        2755        2757
        3803        2568        2804        2747        2927        2953        2920        2754        2761        2752        2758
        3803        2571        2802        2746        2923        2936        2927        2752        2759        2757        2762
        3810        2567        2807        2749        2931        2945        2918        2757        2755        2758        2760
>> 

It might be more efficient to not use the 'CollectOutput' option, and concatenate the vectors afterwards, I leave that up to you to experiment with (hint: use comma-separated lists or cell2mat).

Method two: sscanf. This is by far the fastest method that I found to import that file data, helped by the fact that sscanf is as low-level as it gets. Note that the imported character data is not manipulated, split, or altered in any way before being converted by sscanf:

str = fileread('20200107LaAlO3.csv');
nmb = nnz(str==':'); % number of blocks
nmc = 1+(nnz(str==',')/nmb-1)/2; % number of columns
fmt = repmat(',%f',1,nmc);
fmt = ['%*[\n\r d]elay:%f,%f',fmt(2:end),fmt(2:end)];
mat = sscanf(str,fmt,[2+2*nmc,nmb]).';

The output matrix is exactly identical to the output of Method one:

>> isequal(mat,C{1})
ans =
     1

But it is faster:

Elapsed time is 0.599085 seconds. % Method one
Elapsed time is 0.111845 seconds. % Method two

Caveat: As per isakson noted at the end of their answer, importing this file data will lose information as the file contains more digits precision than can be stored in a Double floating point number (but as this is only at the 18th/19th digit this is very unlikely to be a problem).

5 Comments
Show 3 older commentsHide 3 older comments

per isakson on 9 Jan 2020

Edited: per isakson on 9 Jan 2020

Open in MATLAB Online

@Stephen

"Importing as character and then doing any kind of text manipulation before converting to numeric will be very inefficient for a large file." I don't think that's always true with Matlab. (btw: how large is a large file?)

I've compared the elapsed times of your method-2 to my method with the same 0.2GB text-file, which I used in my answer. There is hardly any difference; both takes 4 seconds. The text-file is stored on a local spinning HD, but during this test I assume that much of the file is in the cache system.

A test with a 1GB (9700 blocks) text-file gave the same result; both methods takes close to 20 seconds.

Pro and cons

Your method-2 uses (I assume) only a fraction of the RAM compared to my method. Thus yours can be used with much larger text-files.
I find it easier to locate and handle "anomalies" in the text-file with my method
My code is more readable - YMMV

>> tic, sas = cssm( '20200107LaAlO3.csv' ); toc
Elapsed time is 3.926282 seconds.
>> tic, sas = cssm( '20200107LaAlO3.csv' ); toc
Elapsed time is 3.978297 seconds.
>> tic, mat = SC_method2( '20200107LaAlO3.csv' ); toc
Elapsed time is 4.018032 seconds.
>> tic, mat = SC_method2( '20200107LaAlO3.csv' ); toc
Elapsed time is 4.053618 seconds.
>> tic, mat = SC_method2( 'large_file_1GB.csv' ); toc
Elapsed time is 19.364099 seconds.
>> tic, sas = cssm( 'large_file_1GB.csv' ); toc
Elapsed time is 19.049979 seconds.
>> tic, mat = SC_method2( 'large_file_1GB.csv' ); toc
Elapsed time is 19.719841 seconds.

where

function    mat = SC_method2( ffs )
    str = fileread( ffs );
    nmb = nnz(str==':'); % number of blocks
    nmc = 1+(nnz(str==',')/nmb-1)/2; % number of columns
    fmt = repmat(',%f',1,nmc);
    fmt = ['%*[\n\r d]elay:%f,%f',fmt(2:end),fmt(2:end)];
    mat = sscanf(str,fmt,[2+2*nmc,nmb]).';
end

Stephen23 on 9 Jan 2020

Edited: Stephen23 on 9 Jan 2020

"how large is a large file?"

That is the crux of it: my own personal definition would be "a file that is big enough that naive code written without consideration of its size will fail when importing or processing", or something like that. I do not consider 0.2 GB large, nor is 2 GB particularly large these days (clearly the definition is entirely system dependent). It is much the same as any other MATLAB processing: when the data gets large enough that care must be taken to prevent data duplication, large intermediate arrays, etc., in order to prevent "out of memory" woes and the ilk, then that is a "large" file.

I have no idea if the OP's actual file meets this criteria.

Your tests probably mostly measure disc access times, which will not change significantly.

per isakson on 9 Jan 2020

Edited: per isakson on 10 Jan 2020

"Your tests probably mostly measure disc access times" No, not accourding to the profiler. fileread() takes less than 20% of the total time.

Sign in to comment.

Answer 2

per isakson on 8 Jan 2020

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/499522-how-to-read-rows-of-data-in-csv-file-and-turn-them-into-arrays#answer_409291

Edited: per isakson on 9 Jan 2020

Open in MATLAB Online

See Import Block of Numeric Data from Text File and search this forum for text read tag:block

You write "very very long time" in a comment without mentioning the elapsed time nor the size of the text file.

Assumptions

Speed is important -"very very long time"
The text file fits in memory - should be two copies of the text file.

Approach

Read the entire file into a character array.
Split the text into a cell array of characters, with one block in each cell
Pre-allocate an output variables based on the size of the cell array
Loop over all blocks and parse one block at a time

Try

>> tic, sas = cssm( 'd:\m\cssm\20200107LaAlO3.csv' ); toc
Elapsed time is 3.898923 seconds.
>> tic,delay=cat(1,sas.delays);toc
Elapsed time is 0.001138 seconds.
>> tic,wavelength=cat(1,sas.wavelengths);toc
Elapsed time is 0.027321 seconds.
>> tic,intensity=cat(1,sas.intensities);toc
Elapsed time is 0.027349 seconds.
>> whos delay wavelength intensity
  Name               Size                 Bytes  Class     Attributes
  delay           1940x2                  31040  double              
  intensity       1940x2068            32095360  double              
  wavelength      1940x2068            32095360  double              

where

function    sas = cssm( ffs )
    chr = fileread( ffs );
    blocks = strsplit( chr, 'delay:' );
    blocks(1) = [];
    len = length( blocks );
    sas(1,len) = struct( 'delays',[], 'wavelengths',[], 'intensities',[] );
    
    for jj = 1 : len    % loop over all blocks
        cac = strsplit( blocks{jj}, newline );
        sas(jj).delays      = sscanf( cac{1},'%f,', [1,inf] );  
        sas(jj).wavelengths = sscanf( cac{2},'%f,', [1,inf] );  
        sas(jj).intensities = sscanf( cac{3},'%f,', [1,inf] );  
    end
end

and where d:\m\cssm\20200107LaAlO3.csv contains 194 copies of your uploaded file. It's size is 0.2GB (200649932/1e9).

>> d = dir('d:\m\cssm\20200107LaAlO3.csv')
  struct with fields:
       name: '20200107LaAlO3.csv'
     folder: 'd:\m\cssm'
      bytes: 200649932
    ...    

Caveat: double isn't enough to exactly represent the numbers in the text-file. A rounding error is introduced.

1.892451199999999858e+02,1.897140509254856511e+02, ...
>> format long
>> wavelength(1,1:2)
ans =
   1.0e+02 *
   1.892451200000000   1.897140509254857
   

/ Dell OptiPlex 7050 (a new desktop), R2018b, Win10

1 Comment
Show -1 older commentsHide -1 older comments

per isakson on 9 Jan 2020

See the code I've added to my answer.

Sign in to comment.

Answer 3

dpb on 8 Jan 2020

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/499522-how-to-read-rows-of-data-in-csv-file-and-turn-them-into-arrays#answer_409292

Open in MATLAB Online

textscan is record oriented. Your file has sections; you have to read a section, rinse and repeat.

Something like

fmt1='delay: %f %f';       % delay record
fmt2=repmat('%f',1,N);     % data records
fid=fopen('yourfile.csv');
delay=[];
inten=[];
rec=1;
while ~feof(fid)
  delay=[delay; cell2mat(textscan(fid,fmt1,'collectoutput',1));]
  if rec==1
    wlen=cell2mat(textscan(fid,fmt2,'collectoutput',1));
    rec=rec+1;
  else
    tmp=cell2mat(textscan(fid,fmt2,'collectoutput',1));  % just trash the duplicated records
  end
  inten=[inten; cell2mat(textscan(fid,fmt2,'collectoutput',1));]
end
fid=fclose(fid);
delay=delay(:);
wlen=wlen(:);
inten=inten(:);

NB: Air code untested since no data file and at least temporarily I have no access so couldn't do so myself anyways...somebody else bound to be along but should get started.

2 Comments
Show NoneHide None

pistachionut on 8 Jan 2020

Seems to take a very very long time

dpb on 9 Jan 2020

That's probably owing to the dynamic reallocation in the above...since it's an uncounted file, I made the assumption the file size would not be particularly large and took the easy route.

Preallocate for some large N of records and populate the array; only reallocating if that large number is exceeded.

Sign in to comment.

How to read rows of data in .csv file and turn them into arrays?

3 Comments
Show 1 older commentHide 1 older comment

Accepted Answer

5 Comments
Show 3 older commentsHide 3 older comments

More Answers (2)

1 Comment
Show -1 older commentsHide -1 older comments

2 Comments
Show NoneHide None

See Also

Categories

Tags

Community Treasure Hunt

How to read rows of data in .csv file and turn them into arrays?

3 Comments Show 1 older commentHide 1 older comment

Accepted Answer

5 Comments Show 3 older commentsHide 3 older comments

More Answers (2)

1 Comment Show -1 older commentsHide -1 older comments

2 Comments Show NoneHide None

See Also

Categories

Tags

Community Treasure Hunt

3 Comments
Show 1 older commentHide 1 older comment

5 Comments
Show 3 older commentsHide 3 older comments

1 Comment
Show -1 older commentsHide -1 older comments

2 Comments
Show NoneHide None