MATLAB Answers

extract routine data from extra large (~20GB) text file

18 views (last 30 days)
Lande GU
Lande GU on 14 Apr 2018
Edited: dpb on 25 Oct 2018
Updated on 21:00 14/April/2018; clear some confusion dpb suggested. Updated part will be highlighted in bold.
Dear friends,
I am currently doing some molecule simulations and I need to analyze the movement of each individual atom.
I got a History file from the simulation which contains irregular format like follows ( now you can see the sample HISTORY in the attached file: HISTORY Trial.txt):
for those really willing to help, a complete short HISTORY file around 3.6GB:
ika9108
0 1 18720 10001 374477446
timestep 0 18720 0 1 0.000500 0.000000
57.9307218288 0.0000000000 0.0000000000
0.0000000000 57.9307218288 0.0000000000
0.0000000000 0.0000000000 57.9307218288
C 1 12.000000 1.123285 0.000000
-15.58443309 -26.16046542 -14.42223305
O 2 16.000000 -1.041095 0.000000
-14.69649899 -26.77784011 -13.67121262
O 3 16.000000 -1.041095 0.000000
-16.81540951 -26.13672909 -14.15028633
O 4 16.000000 -1.041095 0.000000
-15.19910302 -25.79374370 -15.56280780
C 5 12.000000 1.123285 0.000000
-16.61260265 -24.19749101 3.305309244
O 6 16.000000 -1.041095 0.000000
-15.74221447 -23.85292172 4.202348062
O 7 16.000000 -1.041095 0.000000
-16.55264265 -23.67515089 2.115672369
O 8 16.000000 -1.041095 0.000000
-17.54419044 -25.03709893 3.579123068
.
.
.
Ca 11521 40.080000 2.000000 0.000000
-18.93093222 17.98682377 -15.79782631
Ca 11522 40.080000 2.000000 0.000000
19.11661464 -20.33590673 -23.44339428
.
.
.
O_spc 14401 15.999400 -0.820000 0.000000
9.099065343 28.15242293 12.96874971
H_spc 14402 1.008000 0.410000 0.000000
10.11816248 28.04847873 12.79296953
H_spc 14403 1.008000 0.410000 0.000000
8.553146110 28.43604932 12.08437008
O_spc 14404 15.999400 -0.820000 0.000000
-20.67489325 -6.313716149 18.72893163
H_spc 14405 1.008000 0.410000 0.000000
-21.01831712 -6.604184870 19.66593064
H_spc 14406 1.008000 0.410000 0.000000
-20.45237732 -5.303498198 18.81546735
.
.
.
timestep 2000 18720 0 1 0.000500 1.000000
57.9125298023 0.0000000000 0.0000000000
0.0000000000 57.9125298023 0.0000000000
0.0000000000 0.0000000000 57.9125298023
C 1 12.000000 1.123285 0.194440
-15.59133022 -26.23975304 -14.24546014
O 2 16.000000 -1.041095 0.364883
-14.92875146 -26.92787946 -13.43967731
O 3 16.000000 -1.041095 0.064554
-16.84330237 -26.13216641 -14.09355464
O 4 16.000000 -1.041095 0.356997
-15.00432807 -25.74003158 -15.26105138
.
.
.
As shown, the first 2 lines represent the name of system (ika9108) and the number of atoms (18720 in total). Which should be ignored.
At line 3, it indicated current frame in the HISTORY, which in this case (frame 1), it start from timestep 0*0.5 [fs]. And the frame 2 shown next starts from timestep 2000*0.5 [fs]. Each timestep is 0.5 femtosecond; hence, 2000 timestep is 1000 [fs], equal to 1 [ps]. The simulation will simulate every timestep, but only record (1 frame) after every 2000 timesteps.
Line 4 to 6 can be ignored as it's about the volume of the system. Later lines have shown that:
C (atom Carboon), 1 (the number of this atom), 12.000000 and so on (atomic mass, something does not care about)
-15.584 (x axis), -26.160 (y axis), -14.422(z axis)
Likewise, O stand for oxygen atom, 2 represented it is the second atom and everything else goes for the same parameter.
After finishing all 18720 atoms in current frame, the HISTORY moves to the next frame. It will repeat 18720 atoms again, but this time with new x,y,z locations. And this process will repeat thousands of times until it reach 10 nanoseconds.
Hence, I need to extract the changing of locations of one particular atom that out of 18720 atoms and throughout the whole HISTORY file (which mean through all frames); for example:
timestep 0 (frame 1), C, 1, x1, y1, z1
timestep 2000 (frame 2), C, 1, x2, y2, z2
timestep 4000 (frame 3), C, 1, x3, y3, z3
and so on (frame x)
But I have no idea how you can do this with MATLAB. As I search the internet for few commands like textscan, it only read the formatSpec only in 1 line; however, I need to identify and locate the extraction in 2 lines. How can I manage this?
Later on I need to store these seperations and perform mean square displacement calclulation on them.
Thanks for helping. Any suggestion is welcome.
  39 Comments
dpb
dpb on 16 Apr 2018
"As I took another approach to separate all O_spc atoms at the same time, which the processing speed is slow by comparing to your result here."
Which is more convenient going forward; all O_spc in one big file or being able to handle them individually? What is to be done with the position data once have it?
It appears to me the function I wrote is fast enough that you can simply call it in a loop for an input list of atom IDs of interest and still be done before a python script actually reading the whole file can get more than a good start in all likelihood.
As noted, however, there's very little to add to this function to read a list instead of just one; then again, there's little "glue" code to build a combined output file from the results one-by-one as it is, either.
ADDENDUM
I hadn't really looked at the structure inside the model much; I see there are a bunch more water molecules in there than I had guessed, still
Ospc=14401:23038; % list of special O
res=readHist(filename,Ospc(1)); % do first to find out how long is
OspcPos=zeros(length(res),4*length(Ospc)-1); % preallocate
OspcPos(:,1:4)=res; % save first
for o=Ospc(2:end) % iterate over rest
OspcPos(:,o*4-3:o*4)=readHist(filename,Ospc(o));
end
ADDENDUM 2
Well, I tried the above to see; turns out that when one begins traversing the file from front to back so many times the speed does go way down; appears that should make the modification to read sequentially the atoms wanted in a single traverse...or make a combination of some of the items Per mentioned of removing pieces of the file that are known to be of no interest. For this model length(Ospc)/N = 0.125 or 87.5% of the file isn't of interest; that would reduce 3.6G to 0.45G. I don't know for sure just how much time would be required to do the decimation, though.
I'm disappointed; was a little naive in the presumption of speed scaling roughly linearly with numbers, I guess.

Sign in to comment.

Accepted Answer

dpb
dpb on 16 Apr 2018
Edited: dpb on 18 Apr 2018
ADDENDUM Attached updated version that incorporates reading multiple atoms in single run; run-time seems essentially linear with nAtoms now.
ORIGINAL BEGINS Approached it with the same general concept as outlined before--compute where in the file the wanted values are--except instead of trying to read the whole file or very large chunks into memory and then try to pick out the wanted sections; compute the location in the file and just read the desired records directly. I wasn't sure how this would work from a time standpoint; I thought it might just be a problem moving the file pointer but turns out to be blazingly fast...
>> tic;H23040res=readHist('HISTORY.txt',23040);toc
Elapsed time is 0.110633 seconds.
>>
>> [H23040res(end-3:end,1) H23040res(end-3:end,2:4)] % pasted together for easier viewing
ans =
12240000 3.866557282000000 19.186368170000001 6.278881618000000
12242000 4.177788492000000 19.270542259999999 6.410293005000000
12244000 4.425450305000000 19.523942850000001 6.673638392000000
12246000 4.609073503000000 19.099931819999998 5.917552555000000
>>
The function for one atom given by atom number in the simulation
function [res,hdr]=readHist(file,atom)
% return position data from simulation for a given list of atoms
% optionally return the header line for ID
fid=fopen(file,'r');
l=fgets(fid); % read the header line include \n
% define some constants from the file format
L=length(l); % get record length including \n
nTimeLines=4; % number lines in each time block
nAtomLines=2; % number lines each atom output block
hdr=strtrim(l); % save the header line for ID purposes
N=readN(fid); % get the number atoms in simulation
nLinesTimeStep=N*nAtomLines+nTimeLines;
nBytesTimeStep=nLinesTimeStep*L;
d=dir(file);
nTimeSteps=(d.bytes-2*L)/nBytesTimeStep;
res=nan(nTimeSteps,4);
idx=ftell(fid); % pointer begin of time section
for i=1:nTimeSteps
res(i,1)=readTime(fid,L);
res(i,2:4)=readAtomN(fid,L,atom);
idx=idx+nBytesTimeStep;
fseek(fid,idx,-1);
end
fid=fclose(fid);
end
...
Functions are in attached full m-file; just shows the outline at higher level walking through timestep by timestep. The fseek here positions file pointer to just before the next timestep record in the file after having read the previous timestep wanted atom position data array; since each timestep is a fixed size; is simple to just move to that location relative to beginning of file rather than calculate the difference from present position (although that's just the difference between the total length and how many bytes moved plus read). It wouldn't be hard to add the facility to read a list of atoms other than just the one.
I don't disagree with Per's assessment that perhaps some of the Matlab tools for large datasets could be brought to bear, but they seemed to have a pretty steep learning curve to try to deal with the structure of the file at hand being mostly set up for regular tabular data it appeared.
Here you're fortunate that the model is written in Fortran and uses fixed-width records so that one can compute the exact byte position in the file very easily as the multiple of record length. The advantage with this approach is it really doesn't make any difference how long the files are; it appears fast enough to be no problem. This is a moderately new but very mid/low-range 8GB Win7 machine.
  4 Comments
dpb
dpb on 18 Apr 2018
OK, my bad...sorry for the rant if misplaced! :)
Unfortunately, I don't believe the Answers forum is sophisticated-enough to let more than one Answer be accepted; you can always try it and see. What I know that you can do if you were to choose to reward both is that Accept followed by an UnAccept does not take away award points...we don't really do things for the points per se of course, but for the satisfaction of helping and the fun of the challenge.
On the multi-atom case, use the second of the two functions and pass it a vector of the desired atom numbers--it is MUCHO faster than the one-element version in a loop; the latter option is, indeed, unworkable in traversing the file N times.
The two attached files while of the same name aren't the same code; to reduce the confusion factor I'll remove the single-atom one since the multi-atom case reduces to that if only pass a single atom number.

Sign in to comment.

More Answers (2)

per isakson
per isakson on 14 Apr 2018
Edited: per isakson on 20 Apr 2018
First shot. I acknowledge that this might look cryptic, but it seems to work.
>> filespec ='h:\m\cssm\HISTORY_Trial.txt';
>> M = cssm( filespec, 'C', 1 )
M =
1.0e+03 *
0 -0.0156 -0.0262 -0.0144
2.0000 -0.0156 -0.0262 -0.0142
4.0000 -0.0155 -0.0263 -0.0143
6.0000 -0.0156 -0.0261 -0.0140
8.0000 -0.0157 -0.0262 -0.0142
and
>> M = cssm( filespec, 'O', 14 )
M =
1.0e+03 *
0 0.0054 0.0239 -0.0176
2.0000 0.0054 0.0238 -0.0177
4.0000 0.0052 0.0239 -0.0175
6.0000 0.0054 0.0237 -0.0176
8.0000 0.0055 0.0237 -0.0177
where
function M = cssm( filespec, kind, id )
str = fileread( filespec );
str = strrep( str, char([13,10]), char(10) );
fmt = '(?m)^timestep\\x20+(\\d+).+?^%s\\x20+%d\\x20[^\\n]+\\n([^\\n]+)\\n';
xpr = sprintf( fmt, kind, id );
cac = regexp( str, xpr, 'tokens' );
len = length( cac );
M = nan( len, 4 );
for jj = 1 : len
M( jj, 1 ) = sscanf( cac{jj}{1}, '%f' );
M( jj, 2:4 ) = sscanf( cac{jj}{2}, '%f%f%f' );
end
end
I really don't know whether this will work with your large files. However, I'm often surprised by how fast regular expressions are and thus I think it's worth a try.
.
Profiling of cssm
I created a 30MB text file by adding copies of HISTORY_Trial.txt. (I failed to download from your google drive.) Then I run
>> profile on
>> M = cssm( filespec, 'O', 14 );
>> profile viewer
.
I use R2016a/Win7 on an old vanilla desktop.
Total elapse time is <0.9s, more than half of which is used by fileread. (The file was already in system cache.)
.
2018-14-19: Release candidate, RC1
Now I have developed a function that works with the 3.6GB test file. I use a similar approach as in First shot. The main difference is that this functions reads and parses chunks of frames. I gave up on Matlabs Big Data tools.
In short the steps are
  1. Locate the positions of all frames in the file. A frame starts with "t" in the string, "timestep", and ends with the last character before the following string, "timestep". The result is stored in the persistent variable, ix_frame_start, and used in subsequent calls with the same text file.
  2. Calculate the positions of a series of chunks of frames. The result is stored in ix_frame_chunk_start and ix_frame_chunk_end. This calculation is cheep.
  3. Loop over all chunks of frames
  4. Read one chunk with fseek and fread. This is about three times faster than using memmapfile.
  5. Loop over the list of atoms, which was given in the call.
  6. Extract the values of "timestamp" and "X,Y,Z" with regexp.
  7. Return the result in a struct array.
Comments
Compared to the function, readHist by dpb this function
  • is two order of magnitude slower to read position data for one atom
  • is more robust to variations in the text file
I don't see any potential to dramatically improve the speed of this function.
Backlog
  1. Only the regular expression is well commented
  2. Cheep test to catch anomalies in the text file
Example
>> clear all
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 44.231721 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','X','O'}, {18,19,20} ); toc
Error using scan_CaCO3nH2O (line 36)
Cannot find atom: X, id: 19, in frame 1:3
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 30.618218 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 31.088037 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {5884,5890,6107} ); toc
Elapsed time is 45.095658 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O'}, {18,19,20,5884,5890,6107} ); toc
Elapsed time is 62.709259 seconds.
>>
>> plot( S(1).position(:,1), S(1).position(:,2:4), '.' )
>> S(1)
ans =
atom: 'O'
id: {[5884]}
position: [1110x4 double]
>>
.
where in one m-file
function S = scan_CaCO3nH2O( filespec, atom_list, id_list ) %
%
%{
S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} );
%}
narginchk(3,3)
persistent ix_frame_start previous_filespec
if isempty( ix_frame_start ) || not( strcmpi( filespec, previous_filespec ) )
fid = fopen( filespec, 'r' );
ix_frame_start = frame_start_position( fid );
previous_filespec = filespec;
else
fid = fopen( filespec, 'r' );
end
if isscalar( atom_list ) && not( isscalar( id_list ) )
atom_list = repmat( atom_list, 1, length(id_list) );
else
assert( length( atom_list ) == length( id_list )...
, 'scan_CaCO3nH2O:LengthMismatch' ...
, 'Length of "%s" and "%s" differs' ...
, inputname(2), inputname(3) )
end
ix1 = ix_frame_start(1);
ix2 = ix_frame_start(4)-1;
[~] = fseek( fid, ix1-1, 'bof' );
str = fread( fid, [1,(ix2-ix1+1)], '*char' );
for jj = 1 : length( atom_list )
xpr = regular_expression( atom_list{jj}, id_list{jj} );
cac = regexp( str, xpr, 'tokens' );
assert( length(cac) == 3 ...
, 'scan_CaCO3nH2O:CannotFindAtom' ...
, 'Cannot find atom: %s, id: %d, in frame 1:3' ...
, atom_list{jj}, id_list{jj} )
end
target_chunk_length = 1e8;
mean_frame_length = mean( diff( ix_frame_start ) );
number_of_frames_per_chunk = floor( target_chunk_length / mean_frame_length );
ix_frame_chunk_start = ix_frame_start( 1 : number_of_frames_per_chunk : end-2 );
ix_frame_chunk_end = [ ix_frame_chunk_start(2:end)-1, eof_position(fid) ];
S = struct( 'atom' , atom_list ...
, 'id' , num2cell( id_list ) ...
, 'position' , [] );
C = cell( length(ix_frame_chunk_start), length(atom_list) );
for jj = 1 : length( ix_frame_chunk_start ) % loop over all frame chunks
ix1 = ix_frame_chunk_start(jj);
ix2 = ix_frame_chunk_end(jj);
[~] = fseek( fid, ix1-1, 'bof' );
str = fread( fid, [1,(ix2-ix1+1)], '*char' );
C(jj,:) = scan_for_atoms_in_one_chunk_of_frames( str, atom_list, id_list );
end
for kk = 1 : length(atom_list)
S(kk).position = cell2mat( C(:,kk) );
end
end
% -------------------------------------------------------------------
function ix_start = frame_start_position( fid ) %
chunk_size = 3e8; % a larger chunk size increases speed a few per cent
eof_pos = eof_position( fid );
ix2 = 0;
jj = 0;
cac = cell( 1, 200 );
while ix2 < eof_pos
jj = jj + 1;
ix1 = 1 + (jj-1)*(chunk_size-100);
ix2 = min( eof_pos, ix1+chunk_size );
[~] = fseek( fid, ix1-1, 'bof' );
str = fread( fid, [1,(ix2-ix1+1)], '*char' );
cac{jj} = strfind( str, 'timestep' ) + ix1-1;
end
buffer = cat( 2, cac{:} );
ix_start = unique( buffer ); % beware of overlap
end
% -------------------------------------------------------------------
function eof_pos = eof_position( fid ) %
[~] = fseek( fid, 0, 'eof' );
eof_pos = ftell( fid );
end
% -------------------------------------------------------------------
function C = scan_for_atoms_in_one_chunk_of_frames( str, atom_list, id_list ) %
C = cell( 1, length( atom_list ) );
for jj = 1 : length( atom_list ) % loop over all atoms
C{jj} = parse_one_chunk_of_frames( str, atom_list{jj}, id_list{jj} );
end
end
% -------------------------------------------------------------------
function M = parse_one_chunk_of_frames( str, atom, id ) %
xpr = regular_expression( atom, id );
cac = regexp( str, xpr, 'tokens' );
len = length( cac );
M = nan( len, 4 );
for jj = 1 : len
M( jj, 1 ) = sscanf( cac{jj}{1}, '%f' );
M( jj, 2:4 ) = sscanf( cac{jj}{2}, '%f%f%f' );
end
end
% -------------------------------------------------------------------
function xpr = regular_expression( atom, id ) %
fmt = [
'(?m) ' ... % ^ and $ match at line breaks
'^ ' ... % beginning of line
'timestep ' ... % literal: timestep
'\\x20+ ' ... % one or more spaces
'( ' ... % start group to catch "timestamp"
' \\d+ ' ... % one or more digits
') ' ... %
'\\x20 ' ... % one space
'.+? ' ... % one or more chatacters incl. new line, up till
'^%s ' ... % format specifier for "atom" at beginning of line
'\\x20+ ' ... % one or more spaces
'%d ' ... % format specifier for "id"
'\\x20 ' ... % one space
'(?-s) ' ... % from here on dot matches anything except newline
'.+ ' ... % anything, but new line, i.e rest of line
'\\n ' ... % new line
'( ' ... % start group to catch "X,Y,Z"
' .+ ' ... % anything, but new line, i.e rest of line
') ' ... %
'$ ' ... % end of line
];
fmt( isspace(fmt) ) = [];
xpr = sprintf( fmt, atom, id );
end
.
2018-14-20: Release candidate 2.0, RC2
It's important to know when not to use a regular expression.
RC2 is three times as fast as RC1. I achieved this improvement by
  1. replacing consecutive spaces by one space and trim trailing spaces of the 3.6GB text file. The trimmed file is half the size of the original file.
  2. replacing the regular expression by a search with the function, strfind and a simple regular expression.
.
The local function, scan_for_atoms_in_one_chunk_of_frames, now reads
% --------------------------------------------------------------
function C = scan_for_atoms_in_one_chunk_of_frames( str, atom_list, id_list )
pos = strfind( str, 'timestep' );
len = length( pos );
stp = nan( len, 1 );
for jj = 1 : len
stp(jj) = sscanf( str(pos(jj):pos(jj)+24), '%*s%f', 1 );
end
len = length( atom_list );
C = cell( 1, len );
for jj = 1 : len % loop over all atoms
M = parse_one_chunk_of_frames( str, atom_list{jj}, id_list{jj} );
M(:,1) = stp; % this will error if the "atom" isn't found in all timesteps
C{jj} = M;
end
end
% --------------------------------------------------------------
function M = parse_one_chunk_of_frames( str, atom, id )
xpr = regular_expression( atom, id );
pos = regexp( str, xpr, 'start' );
len = length( pos );
M = nan( len, 4 );
for jj = 1 : len
[ ~, remain ] = strtok( str( pos(jj) : pos(jj)+120 ), char(10) );
M( jj, 2:4 ) = sscanf( remain, '%f%f%f', 3 );
end
end
% --------------------------------------------------------------
function xpr = regular_expression( atom, id )
fmt = '\\<%s %d '; % preceded by word boundary, \<, and succeeded by space
xpr = sprintf( fmt, atom, id );
end
Profiling of RC2
In both RC1 and RC2 two function calls totally dominate the execution time. They are
  • fread and
  • scan_for_atoms_in_one_chunk_of_frames, i.e. regexp and strfind, respectively.
With RC1 the execution time of reading and parsing where equal for scanning for one "atom". With RC2 reading is nearly twice as fast because of the smaller size of the test file. With RC2 parsing is several times faster than with RC1.
These three clips are from profiling of RC2 for scanning for one "atom"
.
.
.
Example
Restart of Matlab
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 19.662046 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 11.702999 seconds.
>>
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O'}, {18,19,20,5884,5890,6107} ); toc
Elapsed time is 16.044491 seconds.
  10 Comments
Lande GU
Lande GU on 21 Apr 2018
Yes, me too. Although spending my time on revise for exams now. But after I see your approach here, I have learned a lot of new knowledge. It is very exciting to see later on practice it by myself and even to transfer that idea into Python.
Thank you for your help again.

Sign in to comment.


Sarah Palfreyman
Sarah Palfreyman on 30 Apr 2018
Edited: per isakson on 30 Apr 2018
You can also use Text Analytics Toolbox for reading out of memory data. See TabularTextDatastore
  3 Comments
dpb
dpb on 1 May 2018
Yeah, you noticed... :) I'd like to see all the machinations it would take with any of these tools and then how overhead-intensive would actually be. I'm pretty sure they'd not come close for real-world case such as this. It's pretty straightforward if everything is regular and none of the TMW examples ever show a "dirty" case, they're always trivial or nearly so.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!