Read csv file containing text and numbers for very large dataset (beyond xls limit)

I’m trying to import a very large dataset into matlab from a csv file. The file contains a mixture of numerical and string data. An example of the rows is below:
-15.37 32.83 408.08 1064 -2.35 2.913 -2.31E-05 1E+11
-15.19 -3.624 409.38 1083 -9.81 3.480 4.23E-05 undefined
-15.95 8.534 291.05 993 -133.1 6.866 -2.42E-03 undefined
-15.41 6.975 697.38 686 102.9 8.746 6.24E-03 2.4E+09
I want to import all the data and either replace undefined values with NaNs or remove the row completely containg undefined values. The csvread function in Matlab expects only numerical values so doesn’t work with this dataset and xlsread will only read a finite number of rows and this dataset is beyond that excel row limit so means some datapoints are not read. I’ve tried using importdata but this stops reading after the first row with an undefined value.
I have been able to find a workaround using readtable and then table2array and str2double however this is proving to be very time consuming. For a dataset of around 1.1 million rows it takes 4/5 minutes to read the data into Matlab compared to csvread which will take seconds (if all data is numeric). I’m wondering if anyone knows of a faster way to read in the csv mixed datafile as a matrix.

7 Comments

Because you have the mixed format I think you're going to be stuck reading your data in as either a string, or cells, and then converting.
Out of curiosity, how long does it take to do a textscan of the data? This will at least skip the table2array step.
I've not used textscan beofre but just from looking online I've tried to read the data in using the code below but it doesn't seem to be working. It just outputs D as a 1x8 cell. Any guidance?
fileID = fopen('results.csv');
D = textscan(fileID, '%d %d %d %d %d %d %d %s');
fclose(fileID);
" It just outputs D as a 1x8 cell. "
That is the expected output: you defined a format string with 8 fields, so the cell array will have 8 cells. Have a look inside the cells and see how many rows the data arrays have.
Oh right I see, thank you.
For this specific dataset I know the number of rows is 1,116,650. However I'm trying to create a generic code which doesn't need human input to run but it will have several different dataset inputs and the size of these will vary. I don't want to have to specify the number of rows each time a new dataset needs to be run.
Guessing this means textscan won't be useful for my script then?
@Vicki: The posted textscan command imports 8 columns, the first 7 as doubles, the last as char vector. At no place you limit the code to import a specific number of columns, so the code is generic already. Because it is working already, textscan is useful, obviously.
The only detail you need to solve is to convert the 'undefined' to NaNs. Maybe:
D{8} = strrep(D{8}, 'undefined', 'NaN');
Str = sprintf('%s ', D{8}{:});
N = sscanf(Str, '%g ', Inf);
Now you can use cat to join the columns:
cat(2, D{1:7}, N(:))
This is just for demonstration. Please adjust the command to your exact needs. Or use the code I've posted in my answer.
The output D is an empty 1x8 cell, there's nothing in it when I view the variable so somehting is not working along the way. Sorry I think I may have confused you earlier.
Realised I'd missed out the delimiter term in the fopen function so added this in. It's working now but for some reason only seems to work when I added in a line using fgetl as below. Any ideas why this is?
fileID = fopen('results.csv');
str = fgetl(fileID); % not sure why I need this line but doesn't seem to work without?
D = textscan(fileID, '%f %f %f %f %f %f %f %s', 'Delimiter', ',');
fclose(fileID);

Sign in to comment.

Answers (1)

Please mention, what "very large" means. Some people call a file with 1MB "large" already, because they cannot read it anymore, others consider files under 500TB as small, because this is the size of their scratch disk. Does your file have some 100 MB? Then:
C = fileread(FileName);
C = strrep(C, 'undefined', 'NaN');
D = sscanf(C, '%g ', Inf);
By the way, text files are useful, if they are read or edited by human. Therefore huge text files are a design fail.

2 Comments

My file is 136 MB and contains 1116650 rows and 8 columns.
I've not used fileread before and when I do now using your code, it's producing a 1x134850053 char and I don't know how to convert this to the matrix I want??
If you have not used fileread before, this is the first time, you use it.
While fileread replies a char vector, the sscanf should create a double array. If you want to reshape it, either use the reshape command:
D = reshape(C, [], 8);
or do this inside sscanf already:
D = sscanf(C, '%g ', [8, Inf]);

Sign in to comment.

Categories

Asked:

on 9 Apr 2019

Commented:

on 10 Apr 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!