How to import specific columns from huge csv file?
Show older comments
I have a huge CSV file of about 17 GB. The format is as follows:- The first two columns are strings(to be ignored) and the other columns are integers. I have to do columnwise operation on the other columns and so need to extract them one by one.
I tried csvread() but it can not read the whole file at once.I do not know the number of rows and columns so I cannot specify any range How can I access the data columnwise?
Answers (1)
Accessing a text file columnwise is a very tedious job, even for a computer. Therefore creating huge text files as e.g. CSV is a very bad idea. A binary file would be ways smarter here. Anyway, you have this file now.
If you want to process the file multiple times, think of changing the file format. Split the file and store each column in an own binary file:
inFID = fopen('YourTextFile.csv');
if inFID == -1, error('Cannot open file for reading'); end
Line = fgetl(inFID);
Comma = strfind(Line, ','); % It is a comma separated file, isn't it?
Dummy = sscanf(Line(Comma(1) + 1:end), '%g,', [1, Inf]);
nCol = numel(Dummy);
outFID = zeros(1, nCol);
for iOut = 1:nCol
outFID(iOut) = fopen(sprintf('Column%03d.bin', iOut), 'w');
end
if any(outFID == -1), error('Cannot open file for writing'); end
fseek(inFID, 0, -1); % Return to start
while ~feof(inFID)
Line = fgetl(inFID);
if ischar(Line)
Comma = strfind(Line, ','); % It is a comma separated file, isn't it?
Data = sscanf(Line(Comma(1) + 1:end), '%g,', [1, nCol]);
for iOut = 1:nCol
fwrite(outFID(iOut), Data(iOut), 'double');
end
end
end
fclose(inFID);
for iOut = 1:nOut
fclose(outFID(iOut));
end
Now processing the colums is much easier.
Unfortunately this fails for more than about 250 columns, because the operating system limits the number of simultaneously open files. If this is the case, please explain this as a comment.
Processing the CSV file columnwise has a massive problem: You cannot store the results inside the file, because inserting a value requires to rewrite the complete file, if the number of characters is not the same. Therefore I'm, convinced that converting the text file to a more useful format is the best way.
Rule: Text files are useful only, if they are read and modified by human. A 17GB file cannot be read by human, because everybody will loose the overview soon.
2 Comments
Ronit Samaddar
on 20 Jun 2017
Edited: Ronit Samaddar
on 21 Jun 2017
Jan
on 21 Jun 2017
60'000 columns? You still need to import the complete file to obtain each column. This is a drawback of text files. If you can't import the complete file, but do this column by column, this operation will take 60'000 times longer than processing one column. Reading 17GB * 6e4 will need about 24 days, if you have a fast SSD with 500MB/s read access and the parsing of the line could be neglected. You see, such huge text files are a graveyard of data.
Next idea is to convert the file to one binary file at first:
inFID = fopen('YourTextFile.csv');
if inFID == -1, error('Cannot open file for reading'); end
Line = fgetl(inFID);
Comma = strfind(Line, ','); % It is a comma separated file, isn't it?
Dummy = sscanf(Line(Comma(1) + 1:end), '%g,', [1, Inf]);
nCol = numel(Dummy);
outFID = fopen('Data.bin', 'w');
if outFID == -1, error('Cannot open file for writing'); end
fseek(inFID, 0, -1); % Return to start
while ~feof(inFID)
Line = fgetl(inFID);
if ischar(Line)
Comma = strfind(Line, ','); % It is a comma separated file, isn't it?
Data = sscanf(Line(Comma(1) + 1:end), '%g,');
fwrite(outFID, Data, 'double');
end
end
fclose(inFID);
fclose(outFID);
After this conversion, importing a column is easy:
nCol = 60000;
wantCol = 17;
inFID = fopen('Data.bin');
if inFID == -1, error('Cannot open file for reading'); end
fread(inFID, (wantCol - 1), 'double'); % Skip initial columns
Value = fread(inFID, Inf, 'double', nCol*8);
fclose(fid);
But fortunately modern Matlab version have the command https://www.mathworks.com/help/matlab/ref/datastore.html. See https://www.mathworks.com/help/matlab/import_export/what-is-a-datastore.html and https://www.mathworks.com/help/matlab/import_export/tall-arrays.html. Nevertheless, a conversion to a binary format is a good idea in general.
Categories
Find more on Data Import and Analysis in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!