How to import specific columns from huge csv file?

I have a huge CSV file of about 17 GB. The format is as follows:- The first two columns are strings(to be ignored) and the other columns are integers. I have to do columnwise operation on the other columns and so need to extract them one by one.
I tried csvread() but it can not read the whole file at once.I do not know the number of rows and columns so I cannot specify any range How can I access the data columnwise?

Answers (1)

Jan
Jan on 20 Jun 2017
Edited: Jan on 20 Jun 2017
Accessing a text file columnwise is a very tedious job, even for a computer. Therefore creating huge text files as e.g. CSV is a very bad idea. A binary file would be ways smarter here. Anyway, you have this file now.
If you want to process the file multiple times, think of changing the file format. Split the file and store each column in an own binary file:
inFID = fopen('YourTextFile.csv');
if inFID == -1, error('Cannot open file for reading'); end
Line = fgetl(inFID);
Comma = strfind(Line, ','); % It is a comma separated file, isn't it?
Dummy = sscanf(Line(Comma(1) + 1:end), '%g,', [1, Inf]);
nCol = numel(Dummy);
outFID = zeros(1, nCol);
for iOut = 1:nCol
outFID(iOut) = fopen(sprintf('Column%03d.bin', iOut), 'w');
end
if any(outFID == -1), error('Cannot open file for writing'); end
fseek(inFID, 0, -1); % Return to start
while ~feof(inFID)
Line = fgetl(inFID);
if ischar(Line)
Comma = strfind(Line, ','); % It is a comma separated file, isn't it?
Data = sscanf(Line(Comma(1) + 1:end), '%g,', [1, nCol]);
for iOut = 1:nCol
fwrite(outFID(iOut), Data(iOut), 'double');
end
end
end
fclose(inFID);
for iOut = 1:nOut
fclose(outFID(iOut));
end
Now processing the colums is much easier.
Unfortunately this fails for more than about 250 columns, because the operating system limits the number of simultaneously open files. If this is the case, please explain this as a comment.
Processing the CSV file columnwise has a massive problem: You cannot store the results inside the file, because inserting a value requires to rewrite the complete file, if the number of characters is not the same. Therefore I'm, convinced that converting the text file to a more useful format is the best way.
Rule: Text files are useful only, if they are read and modified by human. A 17GB file cannot be read by human, because everybody will loose the overview soon.

2 Comments

The number of comments is easily more than 250. Actually this was given to me by a senior in a project I am doing. I performed the operation on a smaller file where I could use csvread(). That file contained 60,000 columns. This file is much larger. I dont need to process this file multiple times. Just perform a calculation once per column. Any way such that I can extract each column as an array and know the number of rows and columns would help. I need to perform the operations on each column and I will store the result in a separate file. So rewriting problem is not there. I just found out that the file has 100000 rows and 60000 columns
60'000 columns? You still need to import the complete file to obtain each column. This is a drawback of text files. If you can't import the complete file, but do this column by column, this operation will take 60'000 times longer than processing one column. Reading 17GB * 6e4 will need about 24 days, if you have a fast SSD with 500MB/s read access and the parsing of the line could be neglected. You see, such huge text files are a graveyard of data.
Next idea is to convert the file to one binary file at first:
inFID = fopen('YourTextFile.csv');
if inFID == -1, error('Cannot open file for reading'); end
Line = fgetl(inFID);
Comma = strfind(Line, ','); % It is a comma separated file, isn't it?
Dummy = sscanf(Line(Comma(1) + 1:end), '%g,', [1, Inf]);
nCol = numel(Dummy);
outFID = fopen('Data.bin', 'w');
if outFID == -1, error('Cannot open file for writing'); end
fseek(inFID, 0, -1); % Return to start
while ~feof(inFID)
Line = fgetl(inFID);
if ischar(Line)
Comma = strfind(Line, ','); % It is a comma separated file, isn't it?
Data = sscanf(Line(Comma(1) + 1:end), '%g,');
fwrite(outFID, Data, 'double');
end
end
fclose(inFID);
fclose(outFID);
After this conversion, importing a column is easy:
nCol = 60000;
wantCol = 17;
inFID = fopen('Data.bin');
if inFID == -1, error('Cannot open file for reading'); end
fread(inFID, (wantCol - 1), 'double'); % Skip initial columns
Value = fread(inFID, Inf, 'double', nCol*8);
fclose(fid);
But fortunately modern Matlab version have the command https://www.mathworks.com/help/matlab/ref/datastore.html. See https://www.mathworks.com/help/matlab/import_export/what-is-a-datastore.html and https://www.mathworks.com/help/matlab/import_export/tall-arrays.html. Nevertheless, a conversion to a binary format is a good idea in general.

Sign in to comment.

Categories

Find more on Data Import and Analysis in Help Center and File Exchange

Asked:

on 20 Jun 2017

Commented:

Jan
on 21 Jun 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!