Reading very large, sparse, mixed format CSV

7 views (last 30 days)
Iain
Iain on 16 May 2013
I need to be able to read csv files which:
contain rectangular arrays of values; contain more than 65536 rows; vary in format from file to file (number and content of columns); contain mixed format values (strings AND numbers), and; contain many empty cells.
I then need to be able to distinguish between valid 0's and blanks.
xlsread does the job for less than 65536 rows. csvread does the job for fully populated files. textscan cannot be given a suitable format argument.
I have written some code to do the job, however, it takes excessively long to read the files. - Any ideas?
  3 Comments
Iain
Iain on 17 May 2013
The only control I have is whether or not it has 0, 1 or 2 header rows.
The array is sparsely populated, not stored in the sparse format.
Very large in my context means csv files with approx 60MB, and >>65536 rows.
I have some examples of these files with >60 columns, >65536 rows, and there is a group of 40 columns wich are populated less than a dozen times. Many have 5 columns, and on any given row, 2 columns are unpopulated.
Cedric
Cedric on 22 May 2013
Edited: Cedric on 22 May 2013
Have you tried a solution based on REGEXP/split (mentioned by Per) and a reshape? You could even read the first line with FGETL, count the number of columns, append the rest of the file content that you get with FREAD, REGEXP/split, and reshape the outcome with e.g. reshape(content, nCols, []).'
Let me know if you want to discuss this more in depth. To illustrate (not tested, it's almost pseudo-code):
fid = fopen('myFile.csv', 'r') ;
line = fgetl(fid) ;
nCol = sum(line == ',') + 1 ;
or
col = regexp(line, ',', 'split') ;
nCol = length(col) ;
buffer = str2double(col) ;
isnum = ~isnan(buffer) ; % If relevant for later.
then
content = [line, fread(fid, Inf)] ;
fclose(fid) ;
% .. some code to eliminate \n and/or \r ..
split = regexp(content, ',', 'split') ;
% .. some size adjustment ..
data = reshape(split, nCol, []).' ;

Sign in to comment.

Answers (1)

per isakson
per isakson on 16 May 2013
Edited: per isakson on 16 May 2013
Excel help says
In Excel 2010, the maximum worksheet size is 1,048,576 rows by 16,384 columns.
I believe that in Matlab 64bit xlsread is not limitted by "65536 rows". However, I fail to find it in the Matlab documentation.
  • If the files have some structure, which follow some rules, I'm convinced that textscan can read them piecewise.
  • Does the file fit in the system cache?
  • Use the code you have and transfer the data to readable files in an over night job.
  • There is no generic solution to your problem. If you want help you need to provide some detail.
  5 Comments
Iain
Iain on 17 May 2013
The files contain a single CSV table of a fixed width. (Different widths in different files) Each row in the table contains at least two populated columns. Each column can be populated with text or numbers. Each file contains a different width Each row in the table is populated in response to an event. The first column in the table is a timestamp (directly related to when the event occurred)
The files often look something like:
154654,,,true,,
164821,,false,,,
178195,45,,,,
179195,47,,,,
180195,32,,,,
194821,,maybe,,,
614821,,en?0,,,
981516,,,,t,t
My code goes through this process:
Read in the entire file in a single step (much better than repeated disc accesses over the fairly slow network) I use strfind to get the indices to each end (and start) of line.
I then loop through the lines, a line at a time. In each line I find the commas using strfind, then I loop through each of the commas, getting each value, determining if each value is a string or a number, and recording each value in a cell of a cell array and as a number in a numeric array (NaN when its empty/string)
per isakson
per isakson on 17 May 2013
Edited: per isakson on 18 May 2013
  • "Each column can be populated with text or numbers." makes it hard to use textscan effectively. Are there a limited number of "event-types" and a cheap way to identify the different types? Your example indicates that the answer is no.
  • "Read in the entire file in a single step". A pattern, which often results in fast reading of "mixed" text files, contains the following steps: a) read the entire file with fread, b) find the position of the column-data blocks, c) find the format to use with textscan, d) loop over all blocks and do formatted reading of the file with textscan. In my experience this ( textscan( fid, format, N, ... ) ) is faster than reading the string variable.
  • "find the commas using strfind" . regexp( str, ',', 'split' ) is fast - might be worth a test.
  • I assume you profiled your code - what does it tell?
  • remains to make a MEX-file

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!