Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Reading very large, sparse, mixed format CSV

Asked by Iain on 16 May 2013

I need to be able to read csv files which:

contain rectangular arrays of values; contain more than 65536 rows; vary in format from file to file (number and content of columns); contain mixed format values (strings AND numbers), and; contain many empty cells.

I then need to be able to distinguish between valid 0's and blanks.

xlsread does the job for less than 65536 rows. csvread does the job for fully populated files. textscan cannot be given a suitable format argument.

I have written some code to do the job, however, it takes excessively long to read the files. - Any ideas?

3 Comments

Cedric Wannaz on 16 May 2013

Do you have any control on the file format? If you want to use CSVs for storing very large sparse matrices an you can define the structure of the files, the best way to go is to store triplets rowID,colID,value. This is efficient (at least for a CSV-based solution), easy to read, and perfectly well adapted to building the sparse matrix afterwards.

How to you define "very large" in your context, and how many non-zeros elements do you have typically?

Finally, if you have no control on how files are created initially, could you provide a small size example of what you get in these files?

Iain on 17 May 2013

The only control I have is whether or not it has 0, 1 or 2 header rows.

The array is sparsely populated, not stored in the sparse format.

Very large in my context means csv files with approx 60MB, and >>65536 rows.

I have some examples of these files with >60 columns, >65536 rows, and there is a group of 40 columns wich are populated less than a dozen times. Many have 5 columns, and on any given row, 2 columns are unpopulated.

Cedric Wannaz on 22 May 2013

Have you tried a solution based on REGEXP/split (mentioned by Per) and a reshape? You could even read the first line with FGETL, count the number of columns, append the rest of the file content that you get with FREAD, REGEXP/split, and reshape the outcome with e.g. reshape(content, nCols, []).'

Let me know if you want to discuss this more in depth. To illustrate (not tested, it's almost pseudo-code):

 fid  = fopen('myFile.csv', 'r') ;
 line = fgetl(fid) ;
 nCol = sum(line == ',') + 1 ;

or

 col  = regexp(line, ',', 'split') ;
 nCol = length(col) ;
 buffer = str2double(col) ;
 isnum  = ~isnan(buffer) ;                 % If relevant for later.

then

 content = [line, fread(fid, Inf)] ;
 fclose(fid) ;
 % .. some code to eliminate \n and/or \r ..
 split = regexp(content, ',', 'split') ;
 % .. some size adjustment ..
 data = reshape(split, nCol, []).' ;
Iain

Products

No products are associated with this question.

1 Answer

Answer by per isakson on 16 May 2013
Edited by per isakson on 16 May 2013

Excel help says

    In Excel 2010, the maximum worksheet size is 1,048,576 rows by 16,384 columns. 

I believe that in Matlab 64bit xlsread is not limitted by "65536 rows". However, I fail to find it in the Matlab documentation.

  • If the files have some structure, which follow some rules, I'm convinced that textscan can read them piecewise.
  • Does the file fit in the system cache?
  • Use the code you have and transfer the data to readable files in an over night job.
  • There is no generic solution to your problem. If you want help you need to provide some detail.

5 Comments

per isakson on 16 May 2013

The first problem is that CSV is not well-defined (, which is one of the reasons why there is no implementation of the csv-standard in Matlab). The second is that "excessively long" might mean different things to you and me.

A tentative list of requirements to read the file efficiently:

  • space must not be used as list separator; the columns must be separated by comma, tab, ...
  • the files consist of blocks of "column-data", ( "rectangular arrays of values"), separated by "header-data".
  • the header-data includes a few words, which mark beginning of "header-data"
  • there is a limited number of "types" of blocks of columns data

I assume the files are written by a computer program.

To help you we need one or more of

  • a detailed description of the files
  • some examples together with some description
  • your code, which someone might be able to improve
Iain on 17 May 2013

The files contain a single CSV table of a fixed width. (Different widths in different files) Each row in the table contains at least two populated columns. Each column can be populated with text or numbers. Each file contains a different width Each row in the table is populated in response to an event. The first column in the table is a timestamp (directly related to when the event occurred)

The files often look something like:

154654,,,true,,

164821,,false,,,

178195,45,,,,

179195,47,,,,

180195,32,,,,

194821,,maybe,,,

614821,,en?0,,,

981516,,,,t,t

My code goes through this process:

Read in the entire file in a single step (much better than repeated disc accesses over the fairly slow network) I use strfind to get the indices to each end (and start) of line.

I then loop through the lines, a line at a time. In each line I find the commas using strfind, then I loop through each of the commas, getting each value, determining if each value is a string or a number, and recording each value in a cell of a cell array and as a number in a numeric array (NaN when its empty/string)

per isakson on 17 May 2013
  • "Each column can be populated with text or numbers." makes it hard to use textscan effectively. Are there a limited number of "event-types" and a cheap way to identify the different types? Your example indicates that the answer is no.
  • "Read in the entire file in a single step". A pattern, which often results in fast reading of "mixed" text files, contains the following steps: a) read the entire file with fread, b) find the position of the column-data blocks, c) find the format to use with textscan, d) loop over all blocks and do formatted reading of the file with textscan. In my experience this ( textscan( fid, format, N, ... ) ) is faster than reading the string variable.
  • "find the commas using strfind" . regexp( str, ',', 'split' ) is fast - might be worth a test.
  • I assume you profiled your code - what does it tell?
  • remains to make a MEX-file
per isakson

Contact us