How to import specific columns from huge csv file?

Question

0 votes

I have a huge CSV file of about 17 GB. The format is as follows:- The first two columns are strings(to be ignored) and the other columns are integers. I have to do columnwise operation on the other columns and so need to extract them one by one.

I tried csvread() but it can not read the whole file at once.I do not know the number of rows and columns so I cannot specify any range How can I access the data columnwise?

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Jan on 20 Jun 2017

Edited: Jan on 20 Jun 2017

Open in MATLAB Online

0 votes

Accessing a text file columnwise is a very tedious job, even for a computer. Therefore creating huge text files as e.g. CSV is a very bad idea. A binary file would be ways smarter here. Anyway, you have this file now.

If you want to process the file multiple times, think of changing the file format. Split the file and store each column in an own binary file:

inFID = fopen('YourTextFile.csv');
if inFID == -1, error('Cannot open file for reading'); end
Line   = fgetl(inFID);
Comma  = strfind(Line, ',');  % It is a comma separated file, isn't it?
Dummy  = sscanf(Line(Comma(1) + 1:end), '%g,', [1, Inf]);
nCol   = numel(Dummy);
outFID = zeros(1, nCol);
for iOut = 1:nCol
  outFID(iOut) = fopen(sprintf('Column%03d.bin', iOut), 'w');
end
if any(outFID == -1), error('Cannot open file for writing'); end
fseek(inFID, 0, -1);  % Return to start
while ~feof(inFID)
  Line = fgetl(inFID);
  if ischar(Line)
    Comma = strfind(Line, ',');  % It is a comma separated file, isn't it?
    Data  = sscanf(Line(Comma(1) + 1:end), '%g,', [1, nCol]);
    for iOut = 1:nCol
      fwrite(outFID(iOut), Data(iOut), 'double');
    end
  end
end
fclose(inFID);
for iOut = 1:nOut
  fclose(outFID(iOut));
end

Now processing the colums is much easier.

Unfortunately this fails for more than about 250 columns, because the operating system limits the number of simultaneously open files. If this is the case, please explain this as a comment.

Processing the CSV file columnwise has a massive problem: You cannot store the results inside the file, because inserting a value requires to rewrite the complete file, if the number of characters is not the same. Therefore I'm, convinced that converting the text file to a more useful format is the best way.

Rule: Text files are useful only, if they are read and modified by human. A 17GB file cannot be read by human, because everybody will loose the overview soon.

2 Comments
Show None Hide None

Ronit Samaddar on 20 Jun 2017

Edited: Ronit Samaddar on 21 Jun 2017

The number of comments is easily more than 250. Actually this was given to me by a senior in a project I am doing. I performed the operation on a smaller file where I could use csvread(). That file contained 60,000 columns. This file is much larger. I dont need to process this file multiple times. Just perform a calculation once per column. Any way such that I can extract each column as an array and know the number of rows and columns would help. I need to perform the operations on each column and I will store the result in a separate file. So rewriting problem is not there. I just found out that the file has 100000 rows and 60000 columns

Jan on 21 Jun 2017

Open in MATLAB Online

60'000 columns? You still need to import the complete file to obtain each column. This is a drawback of text files. If you can't import the complete file, but do this column by column, this operation will take 60'000 times longer than processing one column. Reading 17GB * 6e4 will need about 24 days, if you have a fast SSD with 500MB/s read access and the parsing of the line could be neglected. You see, such huge text files are a graveyard of data.

Next idea is to convert the file to one binary file at first:

inFID = fopen('YourTextFile.csv');
if inFID == -1, error('Cannot open file for reading'); end
Line   = fgetl(inFID);
Comma  = strfind(Line, ',');  % It is a comma separated file, isn't it?
Dummy  = sscanf(Line(Comma(1) + 1:end), '%g,', [1, Inf]);
nCol   = numel(Dummy);
outFID = fopen('Data.bin', 'w');
if outFID == -1, error('Cannot open file for writing'); end
fseek(inFID, 0, -1);  % Return to start
while ~feof(inFID)
  Line = fgetl(inFID);
  if ischar(Line)
    Comma = strfind(Line, ',');  % It is a comma separated file, isn't it?
    Data  = sscanf(Line(Comma(1) + 1:end), '%g,');
    fwrite(outFID, Data, 'double');
  end
end
fclose(inFID);
fclose(outFID);

After this conversion, importing a column is easy:

nCol    = 60000;
wantCol = 17;
inFID   = fopen('Data.bin');
if inFID == -1, error('Cannot open file for reading'); end
fread(inFID, (wantCol - 1), 'double');  % Skip initial columns
Value = fread(inFID, Inf, 'double', nCol*8);
fclose(fid);

But fortunately modern Matlab version have the command https://www.mathworks.com/help/matlab/ref/datastore.html. See https://www.mathworks.com/help/matlab/import_export/what-is-a-datastore.html and https://www.mathworks.com/help/matlab/import_export/tall-arrays.html. Nevertheless, a conversion to a binary format is a good idea in general.

Sign in to comment.

How to import specific columns from huge csv file?

0 Comments
Show -2 older comments Hide -2 older comments

Answers (1)

2 Comments
Show None Hide None

Categories

Tags

Community Treasure Hunt

How to import specific columns from huge csv file?

0 Comments Show -2 older comments Hide -2 older comments

Answers (1)

2 Comments Show None Hide None

Categories

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

2 Comments
Show None Hide None