custom byte swapping of binary file

8 views (last 30 days)
Peter
Peter on 21 May 2012
As far as I know fread won't solve this problem (easily) because it applies a single remapping to the entire file. I have a binary file with fields of different byte lengths and types (2 & 4 byte integers as well as floats) that needs to be byte swapped.
My only idea right now is to use fread with and set the position of each entry (break for change in byte swapping type). While this could work it's not very desirable for the project due to the length of each data file and number of file identifiers that would be floating around - plus they'd need to be created dynamically depending on the amount of data written into a given multiplexed file. The project is trying to SEGY data from a custom multiplexing structure written by a post doc that's probably retired by now and no one can find the description of the multiplexing structure. I'm most familiar with MatLab (why I'm trying to do it here) but would like to get the end script to C. The alternative is Unix scripting via some package. I haven't found a nice package to do something like this so if you know of something down this route I'm all ears.
Running 2011b on OS X 10.6.
Thanks, Peter
  1 Comment
Geoff
Geoff on 21 May 2012
Yep I would do this in C... But you could do it in MatLab if you wanted.. Just read the entire file as binary and massage accordingly. When you say "byte swapped" I assuming you mean changing the "endian-ness".
The best way to infer the structure of a binary file format is by examining it with a hex editor. There ought to be a nice free hex editor for Mac about one Google away.
If you have real example data to match with your stored data, this makes the job a lot easier.

Sign in to comment.

Answers (2)

Geoff
Geoff on 21 May 2012
Hey Peter, do you actually know the format of your data? I was under the impression that you didn't know what the binary structure was. Look, if that's the case then this should be simpler than you think.
Don't do multiple file operations.. Just do one. It's called a slurp.
FID = fopen( 'mydata.dat', 'rb' );
data = fread(FID);
fclose(FID);
Now, if you have a particular structure that repeats in the file, make a mapping of how the bytes in this structure should be modified:
bytemap = [1,3,2,7,6,5,4];
Then rearrange your data bytes to align it with this structure (assuming its size is a multiple of that structure size) -- each column will represent one instance of the struct:
data = reshape(data, numel(bytemap), []);
Now, remap the rows based on bytemap:
data = data(bytemap,:);
And, if you like, reshape it back to a vector... or whatever you wanna do:
data = data(:); % <-- optional, really...
FID = fopen( 'remapped.dat', 'wb' );
fwrite( FID, data );
fclose(FID);
  3 Comments
Peter
Peter on 21 May 2012
Geoff, wow! That method will really help things out, and thanks for the generous offer. That gets me really far along but there is still some sorting out to do that I think a sudo match filter will work out. A data file consists of:
[multiplex header | channel 1 | channel 2 | channel 3 | channel 4 | (probably some EndOfFile)]
Format of the multiplex header is unknown, as is any EoF stamp. If a "channel" has data (not all do and I don't know what's there as a place holder, if any) it's SEGY format where a SEGY file is:
[file header | trace1 header | data1 | trace2 header | data2 |.....]
I know the byte structure of the SEGY file and trace headers and the number of samples per data set (from some scribbled notes on the magnetic tape real . So I'm thinking I can create a bytemap (as you named it above) of the trace headers, space two of those the numer of samples appart, and run that as a match filter, looking for fields in the trace header that shouldn't change trace-to-trace. The SEGY file header will be directly before the 1st trace header, so I can grab and decode that to find how many traces the first file has and continue down the rest of the data looking for the next SEGY file. Once all the SEGY files are identified, and any place holders for empty channels I'll have to tackel the multiplex header, hopefully that thing is all one variable type that I can just do random byte swaps on and compare with the size and locations of the SEGY files, without resorting to a hex dump.
Getting the multiplexer header is really what this game is for so that a new code can be written to handel all the remaining files where I don't know the number of samples, or anything other than that it's multiplex SEGY data.
Thanks again for your insight Geoff!!!
Geoff
Geoff on 21 May 2012
Cool. Well, I figured your problem would not be as simple as I described, but exploiting MatLab's matrix functions is a reasonable foundation... Like you say, you just need to throw it at the right sections of the data... And MatLab's helpful that way too. Find your channel start/end ranges and just index those bytes directly out. Searching for recognisable patterns is perfectly valid, and you can use regexp() or strfind() for that. If you get false hits it'll be pretty obvious.
The other way of course is to decipher the basic structure of the rest of the file... There's only a handful of options for storing arbitrary binary data... You either store the size of a structure/chunk as you go (eg AVI files), have a hard-coded structure size (eg BMP files), or have a specific pattern that denotes the end of data and/or beginning of new data. If you get stuck on reverse engineering parts of the file format, get in touch with me. I've had a bit of experience with this kind of thing and might be able to help.

Sign in to comment.


Jan
Jan on 21 May 2012
I do not understand the question. Do you want to read a IEEE-LE ordered file on a IEEE-BE machine? Then you can specify the ordering in fopen. If you want to define the ordering for a specific element only, you can use:
FREAD(FID, SIZE, PRECISION, MACHINEFORMAT)
as explained in help fread.
  2 Comments
Peter
Peter on 21 May 2012
Jan,
The problem is that the entire file is not one byte format or 'precision' in the case of fread. I'd have to do something like:
data=fread(FID,1:1,'int8','b');
temp=fread(FID.2:3,'int16','b')
data=[data temp];
temp=fread(FID,4:7,'single','b);
data=[data temp];
etc
This wouldn't be too bad except for my file sizes and how long MatLab takes to load data with the fread - on the order of a day for a single file (yeah I've tried this sort of approach for a simpler file structure so I just let it run over the weekend) and there's a lot of files to be attended to.
Jan
Jan on 21 May 2012
1. I'd omit the 'b' in the FREAD and add it to the FOPEN. 2. I expect, that not FREAD is the problem, which limits the speed, but "data=[data, temp]". Letting a variable grow repeatedly is a bad idea. You can search in this forum for "pre-allocation" to learn more about this.
3. In "fread(FID, 4:7)" you read a [4x5x6x7] array. Is this intended?
4. I still do not understand, why the swapping of the bytes is required.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!