read file with ascii characters and binary data on one single line

43 views (last 30 days)
Hi,
I need to read data from a file which consists of approximately 1000 lines of ascii header and then a line consisting of the delimiter "#!" and the data in the file (written in binary data (Int32)). I can read the header using fgetl(fid) and also identify the delimiter, but when trying to continue reading from there using fread(fid,3,'int32') the values are non-sense. Do I somehow loose the reading position?
If I cut the delimiter and the header of the file using a text editor, I can read the purely binary file with fread(fid,3,'int32'). But I would like to speed it up without manually manipulating the file. Any suggestions? Thanks a lot!
So far I'm trying this:
fn='test.nid';
fid = fopen(fn, 'r', 'ieee-be');
stillheader=1;
while stillheader
% store the current position in file
headerendpos_ascii=ftell(fid);
oneline=fgetl(fid);
k=k+1;
if isempty(strfind(oneline,'#!')) % it's not yet the delimiter
% read relevant data from header ...
else
stillheader=0;
end
end
% now read the delimiter and the binary data
fseek(fid,headerendpos_ascii,'bof'); % move to beginning of line with delimiter
test=textscan(fid,'%c',2) % read the delimiter, now we are behind the delimeter #!
binarydata=fread(fid,200,'int32'); % read the first 200 valkues of the binary data.
fclose(fid)
  8 Comments
Guillaume
Guillaume on 25 Mar 2019
The beginning of the data part should be the three numbers 48053388, 49088328, 50666668
None of these numbers can be found anywhere in your attached file, encoded as int32 or uint32, in little-endian or big-endian. I don't see how you could ever get these numbers from the file.
Note that if you were editing the file with a text editor to crop the text, then your editor may well have changed the actual binary values. It is not safe to edit a binary file in a text editor.
I'll repeat my request for actual documentation of the format. It'd be a lot easier to understand how data is encoded.
How did you find out about the number of bytes between #! and the end?
fid = fopen('20190322_KBr_tipC5_07356.nid'); %open file in binary mode
bytes = fread(fid, [1 Inf], '*uint8'); %read the whole lot as bytes
fclose(fid);
binstart = strfind(bytes, '#!') + 2; %find location of #!. skip these two characters
numel(bytes) - binstart %number of bytes after the #!
ans =
327679
Not an even number.
By the way:
>> strfind(bytes, typecast(int32(48053388), 'uint8')) %search 48053388 as int32, little endian
ans =
[]
>> strfind(bytes, typecast(uint32(48053388), 'uint8')) %search 48053388 as uint32, little endian
ans =
[]
>> strfind(bytes, fliplr(typecast(int32(48053388), 'uint8'))) %search 48053388 as int32, big endian
ans =
[]
>> strfind(bytes, fliplr(typecast(uint32(48053388), 'uint8'))) %search 48053388 as uint32, big endian
ans =
[]
Beat Braem
Beat Braem on 25 Mar 2019
Unfortunately, I cannot provide any documentation about the data file, I don't have any.
I agree, that working on such a file with the text editor is questionable. But on the other hand, like this I could import the data, plot it, and check that it is correct. The 3 numbers I provided above are part of the data set, which I checked like this.
Thanks a lot for you efforts and the idea, that also the operation with the text editor could have caused an error. I will use the bit counting operations you provided above to investigate, how the file changes it's lenght during the whole operation.
I don't have time for this today, but I'll do it in the next days. Thanks a lot for your effort, I'll post it once I know more!

Sign in to comment.

Accepted Answer

dpb
dpb on 25 Mar 2019
Edited: dpb on 25 Mar 2019
As noted in the comment, it's bizzaro way to have done, but the following seems to work...
>> l=fgetl(fid);while ~contains(l,'#!'), l=fgetl(fid);end % get to the delimeter line...
>> fseek(fid,-numel(l)+1,'cof'); % backup to just past the #!
>> fread(fid,3,'*int32')
ans =
3×1 int32 column vector
47465896
49587076
45505136
>>
These aren't quite the same values as OP says, probably he's looking at a different file than the one he posted.
NB: The position after the while loop will be dependent on the data content -- fgetl won't terminate until if finds a two-byte sequence that qualifies as line terminator and it'll be dependent upon the actual data values where that is. So, back up the number of bytes in the last read and the offset of the two terminator bytes that offset the two indicator characters and add one. This is dependent upon the Windows convention which it appears the file follows.
It would be more robust to do the read on a character-by-character basis or, (as I think G?) suggested suck it all up as char() and do a text search for the magic character string and then decode the rest from that point.
  1 Comment
Beat Braem
Beat Braem on 25 Mar 2019
Great, this works! thanks a lot!
You're right, apparently I sent the three numbers of the scan file previous to the one, which I attached. Sorry for that!

Sign in to comment.

More Answers (2)

Guillaume
Guillaume on 25 Mar 2019
Edited: Guillaume on 25 Mar 2019
Right, now that we've resolved that the published int32 were incorrect. Here how I would parse the file.
fid = fopen('20190322_KBr_tipC5_07356.nid'); %open file in binary mode
bytes = fread(fid, [1 Inf], '*uint8'); %read the whole lot as bytes
fclose(fid);
binstart = strfind(bytes, '#!') + 2; %find location of #!. skip these two characters
text = char(bytes(1:binstart-2));
bindata = typecast(bytes(binstart:numel(bytes)), 'int32');
Note that the above should be a lot faster than reading the file line by line.
Things to take into account:
  • It is assumed the text encoding is the same as the one used by matlab. Strange characters will appear if this is not the case. The file format specification would tell you what encoding is used. Modern software would most likely use unicode.
  • It is assumed that numbers are encoded as 32-bit signed integer in little endian. Since the text header seems to specify the encoding, I assume it can varies. My method (and any of the other solutions) would completely break down the day you come across a file with different encoding. You could at least check that with:
encodings = regexp(text, 'SaveMode=(\w+)\s+SaveBits=(\d+)\s+SaveSign=(\w+)\s+SaveOrder=(\w+)', 'tokens');
encodings = vertcat(encodings{:});
assert(all(strcmp(encodings(:, 1), 'Binary')), 'At least one of the channels is not encoded as binary');
assert(all(str2double(encodings(:, 2)) == 32), 'At least one of the channels is not encoded on 32 bits');
assert(all(strcmp(encodings(:, 3), 'Signed')), 'At keast one of the channels is not signed');
assert(all(strcmp(encodings(:, 4), 'Intel')), 'At least one of the channels is not little endian');
  • To properly parse the file, you really need to parse the [Dataset] portions of the code and decode the binary according to the encodings above.
edit: actually here is a better parser, based on the notes above, but still assuming fixed encoding
It is assumed that the order of the fields in the [Dataset] portion of text is fixed (otherwise a more complex parsing of the text is requiried.
fid = fopen('20190322_KBr_tipC5_07356.nid'); %open file in binary mode
bytes = fread(fid, [1 Inf], '*uint8'); %read the whole lot as bytes
fclose(fid);
binstart = strfind(bytes, '#!') + 2; %find location of #!. skip these two characters
text = char(bytes(1:binstart-2));
encodings = regexp(text, 'Points=(\d+)\s+Lines=(\d+).+?SaveMode=(\w+)\s+SaveBits=(\d+)\s+SaveSign=(\w+)\s+SaveOrder=(\w+)', 'tokens');
encodings = vertcat(encodings{:});
assert(all(strcmp(encodings(:, 3), 'Binary')), 'At least one of the channels is not encoded as binary');
assert(all(str2double(encodings(:, 4)) == 32), 'At least one of the channels is not encoded on 32 bits');
assert(all(strcmp(encodings(:, 5), 'Signed')), 'At keast one of the channels is not signed');
assert(all(strcmp(encodings(:, 6), 'Intel')), 'At least one of the channels is not little endian');
datasetsizes = str2double(encodings(:, [1 2]));
assert(sum(prod(datasetsizes, 2)) == (numel(bytes) - binstart + 1)/4, 'File does not have the right length');
bindata = typecast(bytes(binstart:numel(bytes)), 'int32');
datasets = mat2cell(bindata, 1, prod(datasetsizes, 2))';
datasets = cellfun(@(v, s) reshape(v, s), datasets, num2cell(datasetsizes, 2), 'UniformOutput', false);
  2 Comments
dpb
dpb on 25 Mar 2019
  • To properly parse the file, you really need to parse the [Dataset] portions of the code and decode the binary according to the encodings above.
+1
Beat Braem
Beat Braem on 26 Mar 2019
I tested the solution presented above on different files of the same type and it works great except a minor detail: sometimes, the binary can coincidentially contain the same content as '#!', then binstart contains more than one element. Adding an extra line to discard all elements except the firs one solves the issue.
fid = fopen('20190322_KBr_tipC5_07356.nid'); %open file in binary mode
bytes = fread(fid, [1 Inf], '*uint8'); %read the whole lot as bytes
fclose(fid);
binstart = strfind(bytes, '#!') + 2; %find location of #!. skip these two characters
binstart=binstart(1); % discard further elements in binstart in case #! also appears in the binary data
text = char(bytes(1:binstart-2));
encodings = regexp(text, 'Points=(\d+)\s+Lines=(\d+).+?SaveMode=(\w+)\s+SaveBits=(\d+)\s+SaveSign=(\w+)\s+SaveOrder=(\w+)', 'tokens');
encodings = vertcat(encodings{:});
assert(all(strcmp(encodings(:, 3), 'Binary')), 'At least one of the channels is not encoded as binary');
assert(all(str2double(encodings(:, 4)) == 32), 'At least one of the channels is not encoded on 32 bits');
assert(all(strcmp(encodings(:, 5), 'Signed')), 'At keast one of the channels is not signed');
assert(all(strcmp(encodings(:, 6), 'Intel')), 'At least one of the channels is not little endian');
datasetsizes = str2double(encodings(:, [1 2]));
assert(sum(prod(datasetsizes, 2)) == (numel(bytes) - binstart + 1)/4, 'File does not have the right length');
bindata = typecast(bytes(binstart:numel(bytes)), 'int32');
datasets = mat2cell(bindata, 1, prod(datasetsizes, 2))';
datasets = cellfun(@(v, s) reshape(v, s), datasets, num2cell(datasetsizes, 2), 'UniformOutput', false);
Thanks a lot to all who contributed to the solution!

Sign in to comment.


Rik
Rik on 25 Mar 2019
If you are on Windows and opening the file with notepad a line feed (ASCII 10) is not enough to trigger a new line, as you will also need a carriage return (ASCII 13). Most other viewers (like e.g. notepad++) do not require CRLF but will also interpret LF as a newline.
As for the rest of your question, it might not be easier to read everything as uint8 and then convert everything up to the first occurrence of [35 33] to text (a simple call to char should work with CP1252, for UTF8 it is less trivial). Then you need to reinterpret the rest from uint8 to int32 with typecast.
  9 Comments
Guillaume
Guillaume on 25 Mar 2019
Edited: Guillaume on 25 Mar 2019
I'm of the opinion that unless the answer is completely off base, it should be preserved. There's still some useful discussion here (about 't' mode).
Totally off topic here, but 't' mode is sometimes useful if you're going to use regexp on the text. For regexp a newline (for the dotexceptnewline option and the $ match) is always just \n, so removing the \r makes the search easier.

Sign in to comment.

Categories

Find more on Data Import and Export in Help Center and File Exchange

Products


Release

R2013b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!