MATLAB Answers

EL
0

How do I split a sing column .txt file by line?

Asked by EL
on 28 Aug 2019
Latest activity Commented on by Adam Danz
on 8 Oct 2019 at 16:42
Accepted Answer by dpb
Hey Guys,
How would I split a .txt file into smaller files by the number of lines? This was simple to do in linux, but I can't seem to do it here.
An example of a file is attached (testv2.txt)
EDIT: The .txt files I'm working with are very large, and I need to split them into files with 72,000,000 lines. I can't split the files by size, because for some reason some files are different sizes, and the script I'm using tells time by using the # of lines.
Thanks for the help guys!

  4 Comments

Show 1 older comment
Absolutely. I have 1.4 billions lines of data, and I need to split them into managable sizes to precise number of lines so I can perform good statistics. Ideally, I'd like it split the .txt into new .txt files. So I'd have the original, unadulterated file (backup data), and new .txt files that are 72,000,000 line sections of the original data. I'm not too worried about the empty first column.
What version of matlab are you using?
2018a

Sign in to comment.

Products


Release

R2018b

2 Answers

Answer by dpb
on 28 Aug 2019
 Accepted Answer

Again, I'd suggest there's no need to actually create multiple text files to do this...several options exist in MATLAB; the simplest is probably to just process the file in chunks of whatever size you wish and calculate statistics or do whatever on each section...something like
fid=fopen('yourfile.txt','r');
NperSet=72E6; % set number elements to read per section
ix=0; % initialize group index counter
while ~feof(fid) % go thru the file until run out of data
ix=ix+1; % increment counter
data=cell2mat(textscan(fid,'%\t%f',NperSet)); % read the data chunk of set size, skip \t
stats(ix,:)=[mean(data) std(data) ...]; % compute, save the stats of interest
... % do whatever else needed w/ this dataset here
end
You'll want to preallocate the stats array to some reasonable approximation of the size expected and check for overflow, but that's the basic idea...simpler than creating and having to traverse thru a bunch of files to simply process in sequence.
The alternative is to use tall arrays or memmapfile or other of the features TMW has provided for large datasets. See <Large-files-and-big-data link>

  29 Comments

dpb
on 31 Aug 2019
""My intention is make this script automatically chop up 27 hours of data in 1 hour bins, and have those files loaded in order in the same way they're loaded here. ""
"That's exactly what my answer does if you wish to go down the route."
Or what mine does just without actually making files but using the data from the full file one piece at a time...which is the same result w/o the intermediate step.
I had made a start towards the factorization but life has intervened and now prevents me from investing more time at this time...I'll attach the beginnings of converting the script to functions altho had just gotten to the point of considering the main calculations so nothing there to report...
There is no optimization or reduction of superfluous intermediaries in the above as yet--strictly a factoring out of the initial portions to be callable functions in an eventual script.
My vision/intent was to remove the reliance upon splitting files and having the user specify the actual experiment file(s) wanted to be analyzed and then process those piecewise by whatever amount of memory is available to read/hold the data at one time. Understanding the sequence of which files and how those files were built was the point behind the last Q? of just what that list of files initially read actually represents.
If could manage to reduce a bunch of the machinations on doubly-dimensioned cell arrays and such along the way, that would have been gravy in reducing overhead in both memory and speed.
Yeah I (still) agree that there's no need to store the segmented data in text files and that dpb's approach is the better one.
dpb
on 31 Aug 2019
On the comment about hidden and accepted bugs -- just for the record I did err in my earlier post regarding the comparison/subtraction of polynomial coefficients from observations; the code at that point indeed does correctly detrend the data for the x values selected.
I was, however, still at the point that I hadn't quite determined just why the x values were/are being selected as they are for the independent variable in the plots...it probably is ok if they have used this successfully for so long, but it still seems a peculiar way to have coded it if it is just piecing back together the time series/building a time vector from a fixed sample rate that I hadn't yet got my head around just what is behind having been done the way it is.

Sign in to comment.


Answer by Adam Danz
on 28 Aug 2019
Edited by Adam Danz
on 29 Aug 2019

This solution is quite fast and uses fgetl() to read in blocks of a text file and saves those blocks to a new text file. You can set the number of rows per block and other parameters at the top of the code. See comments within the code for more detail.
% Set the max number of lines per file. The last file may have less rows.
nLinesPerFile = 10000;
% Set the path where the files should be saved
newFilePath = 'C:\Users\name\Documents\MATLAB\datafolder';
% Set the base filename of each new file. They will be appended with a file number.
% For example, 'data' will become 'data_1.txt', 'data_2.txt' etc.
newFileName = 'data';
% Set the file that will be read (better to include the full path)
basefile = 'testv2.txt';
% Open file for reading
fid = fopen(basefile);
fnum = 0; % file number
done = false; %flag that ends while-loop.
while ~done
% Read in the next block; this assumes the data starts
% at row 1 of the txt file. If that is not the case,
% adapt this so that the header rows are skipped.
tempVec = nan(nLinesPerFile,1);
for i = 1:nLinesPerFile
nextline = fgetl(fid);
if nextline == -1
done = true;
tempVec(isnan(tempVec)) = [];
continue
else
tempVec(i) = str2double(nextline);
end
end
% Write the block to a new text file.
if ~isempty(tempVec)
fnum = fnum+1;
tempFilename = sprintf('%s_%d.txt',newFileName,fnum); % better to include a full
tempFile = fullfile(newFilePath,tempFilename);
fid0 = fopen(tempFile,'wt');
fprintf(fid0,'%.6f\n',tempVec);
fclose(fid0);
% (optional) display link to folder
disp(['<a href="matlab: winopen(''',newFilePath,''') ">',tempFilename,'</a>', ' saved.'])
end
end
fclose(fid);

  2 Comments

Dear Adam,
I have .txt files with 8 or 14 columns and continuous rows in thousands. i used this code to split the file into separate blocks of 2500 rows and columns and converted completely, but output files were created with NAN and only one column (NAN written 2500 times). As I have comma separated file.
Any suggestions please?
Adam Danz
on 8 Oct 2019 at 16:42
Hi Hamad,
I would use the debug feature.
Put a break point at the top of your code and step through each line, looking at the outputs. "tempVec " produces a vector of NaNs. Maybe those values are never being filled?

Sign in to comment.