Reading ASCII file in portions

7 views (last 30 days)
Brian
Brian on 16 Sep 2013
I have 4.5GB ASCII file. I would like to read it in portions.
For example I would like to read 1GB at a time, and store the read DATA into MATLAB. After the 1GB read I would like to add the next 1GB of data read to the existing stored data.
Is it possible and if so what is the code?
I tried to use something like the following
segsize = 1000000;
while ~feof(fid)
data=fread(fid,segsize,'*char');
end
But it is not reading the entire file. I am guessing it stops at the number of the segsize. How do I make it read 1 GB and store it in MATLAB; then read another 1GB? I'd like to conserve RAM as I intend to read much larger files.
Thanks for the Help!
  2 Comments
Matt Kindig
Matt Kindig on 16 Sep 2013
Is "data" changing throughout the loop? In other words, is the fread() correctly reading? How big is "data" after, say, one iteration of the while loop?
Brian
Brian on 16 Sep 2013
Well if I intend to read through 1 GB of code then data should be 1 GB big right?

Sign in to comment.

Answers (1)

Walter Roberson
Walter Roberson on 16 Sep 2013
feof(fid) does not predict that end of file is about to occur: feof() is not true until an end-of-file has already occurred. You need to be checking how much data you got back from the fread because you might not get any (because it was positioned right before end of file before the fread() )
[data, count] = fread(fid, segsie, '*char' )
Question: is the file definitely ASCII? As in the last printable character is decimal 126, the tilde ("~") character? Or is the file potentially UTF-8 or UTF-16 encoded due to having been created that way or edited using an editor that automatically saves to UTF-* ? If the file happens to contain bytes with value beyond 127, what do you want to happen? Should the fread() try to examine the byte sequence to see if it should decode the UTF-8 or UTF-16 into the Unicode that MATLAB uses internally? Or should the fread() return each byte of input as a distinct position in the string?
The code you have now is for the case where the file might possibly be UTF-* encoded and the fread() is to examine the bytestream to see if it can decode it. If you do not want that to happen, then instead of '*char' use 'uint8=>char'
  2 Comments
Brian
Brian on 16 Sep 2013
I am not sure if I quite follow, but the type of file is ASCII. I don't think the printable characters are in ASCII format though. The last character is not a "~". Sorry I do not know what UTF is either. As you mentioned earlier how would I make my code read from the beginning since you said the end-of-file has already occurred?
Walter Roberson
Walter Roberson on 16 Sep 2013
Does the file contain any characters other than A-Z a-z 0-9 ~!@#$%^&*()_+`-=[]{}\| ;':",.<>/? and spaces and end-of-line characters ?
For information on UTF-8 see http://en.wikipedia.org/wiki/UTF-8

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!