Deriving specific rows from a large text files

Hello! I am very new at MATLAB and am still learning. I have a 66 million row dataset in the form of a text file. I need to pull certain rows from the list to create a new list. Specifically, the data presents as:
Time,Pressure,Sea pressure,Depth
2018-08-25 16:10:26.000,10.2011833,0.0686833,0.0681232
The data isn't in separate columns, it's comma separated (except date and time, which are separated by a space). The new list needs to contain anything with a time stamp that ends in milliseconds of:
.000
.125
.375
.500
.625
.750
.875
I am not sure what the best way would be to address this. Any help would be greatly appreciated!

14 Comments

The best way to approach it would be to upload a sample of your textfile and let someone figure it out for you!
Perhaps you should create a timetable object and use retime()
Hi Jonas, Thank you for the advice, I am trying to learn right now how to extract just the first few rows to create upload since the file itself is too large.
Open it in the editor and select a few lines and paste them in another file...
Hi Walter, thank you for your advice, I am looking up what time timetable object and retime() is and how to use it. Thank you for the hint.
Are the asterisks actually in the file or is that a fignewton of editing the data done in posting?
Hi dpb, thank you for the suggestion, when I try to select a few lines via editor, I get the error of not having enough memory. I have been able to successfully retrieve lines by using:
firstline =fgetl(fid)
The asterisks in the original post is from having bold text, I'm still learning how to present comments on here, sorry about that.
TT1 = table2timetable(readtable('YourFilenameGoesHere.txt'));
TT2 = retime(TT1,'regular', 'TimeStep', seconds(1/8));
dpb
dpb on 28 Oct 2018
Edited: dpb on 28 Oct 2018
That's good thought to use the 1/8th second for the time vector, Walter. That's newer syntax than I have, however...only way w/ R2017b would be to generate the full-length newtime vector explicitly, hence the selection process.
I wasn't aware of the seconds(1/8) syntax, either.
I have attached a sample of the data! Thanks!
I am receiving the following error, could this because I am using MATLAB R2015a?
Warning: Variable names were modified to make them valid MATLAB identifiers.
Undefined function or variable 'table2timetable'.
That's quite common, it just means that the variable headers are not valid identifiers, i.e. they cannot be used as variable names. For example "Sea pressure" is not valid.
Thanks Jonas, I have tried adjusting the code to this:
>> TT1 = readtable('sample1.txt');
TT2 = etime(TT1,'regular', 'TimeStep', seconds(1/8));
Error using etime
Too many input arguments.
The function Walter used is retime not etime.

Sign in to comment.

 Accepted Answer

I would recommend updating your MATLAB to the latest version, since R2015a does not support many useful functions for your task, such as retime. If you have to do your task in old MATLAB for some reasons, the following would be one possible solution.
T = readtable('sample1.txt',...
'ReadVariableNames', false,...
'HeaderLines', 1);
T.Properties.VariableNames = {'Time','Pressure','SeaPressure','Depth'};
c = regexp(T.Time,'(000|125|250|375|500|625|750|875)$','match');
idx = cellfun(@isempty,c);
T = T(~idx,:);

1 Comment

Thank you Akira! This worked perfectly. I am taking your suggestion and will be updating our MATLAB. Thank you again for your help!

Sign in to comment.

More Answers (1)

Read as table or timetable; deal with the asterisks as needs must depending on whether they're real or not...
Two alternatives, retime to a specific vector means making a new time vector; the alternative would be to find the locations matching the desired times..
t=readtable('yourfile');
ix=ismember((t.Time-fix(t.Time))*1000,[0:7]*125)); % locate millisec multiples of 125
t=t(ix,:); % save those into the desired table

15 Comments

I would probably add a round() around the millisecond calculation.
dpb
dpb on 28 Oct 2018
Edited: dpb on 28 Oct 2018
If it weren't being read from a file that appears to have three significant fractional digits I'd've done so, Walter. Looks like (at least from the sample record posted) that there are only three digits in the time stamp fractional seconds so the output routine will have already done the rounding for us here...
Hi Walter and dpb, thank you both for helping me. I am going to try and post a sample of the data and then I am going to try both of your suggestions so I can learn.
Because datetimes are not internally represented in decimal, input in decimal to 3 digits might not match to representation, so better to round() to be sure.
I am receiving the following warning and error:
Warning: Variable names were modified to make them valid MATLAB identifiers.
Undefined function 'fix' for input arguments of type 'cell'.
I am now getting this error, is this because it's all in one single column:
t=readtable('sample1.txt');
ix=ismember((t.Time(t.Time))*1000,[0:7]*125); % locate millisec multiples of 125
t=t(ix,:); % save those into the desired table
Warning: Variable names were modified to make them valid MATLAB identifiers.
Unrecognized row name '2018-08-25 16:10:26.000'.
Oh. I presumed you had a header line like the one shown in the original question.
If there aren't, ML will use Var1, Var2, etc., ...
After readtable use\
t.Properties.VariableNames
to see what actually were imported as.
You can also do things like
t(1:4,:)
to just see the first lines of the table...
Poke around... :)
Hi dpb, the first line does have the headings (see .txt file in original post). I am going to try the tips you mentioned tomorrow. I really appreciate your help on this.
Use the optional 'ReadVariableNames' parameter if you need to...read the documentation; there's a lot of information there plus examples to guide...
"Because datetimes are not internally represented in decimal, ..."
But N*(1/8) is exactly representable and since that's the value that was wanted I used the fact... :)
In absolute general terms, there may be some value that isn't but I'd guess the rounding in the i/o routines from the internal datetime conversion process will be the same rounding so won't actually be an issue. But it couldn't hurt, granted, to use both the belt and the suspenders. :)
Hi dpb, Thanks for your help on this again. I will be updating my MATLAB and I think once that happens, the code you developed will work. As Akira mentioned, these codes aren't working because my MATLAB is outdated.
"But N*(1/8) is exactly representable and since that's the value that was wanted I used the fact... :)"
However, you are assuming that the f[r]action of a seconds are stored as a separate floating point field, which is not the case. The time is represented as a "double-double stored as a complex" counting seconds since some epoch.
Not exactly. I was just counting on the end decimal fraction of 1/2 being stored as exact in whatever form it is internally(*); how TMW had chosen to do it internally, specifically, I didn't know. That the input was coming from a fixed format being converted rather than some other possibly subject to other rounding floating point calculation made me believe it would have all the rounding it needed from the rtl i/o conversion routines.
"Assumptions"? Yes, but I think very reasonable ones in the case.
As noted, "belt and suspenders" can't hurt even if it doesn't make any difference in a given case.
(*) And I was comfortable there is more than sufficient precision allocated in that the documentation guarantees precision to six digits (last one rounded) beyond the milliseconds needed:
--S, SS,...,SSSSSSSSS Fractional second, using the number of digits specified
by the number of instances of 'S' (up to 9 digits).
in the format descriptors.
t = readtable('sample1.txt'); %up to R2018b still cannot autodetect times with fractions of a second, even if detectImportOptions is used
tt = datetime(t.Time, 'inputformat', 'yyyy-MM-dd HH:mm:ss.SSS');
s = mod(second(tt),1) * 1000;
mask = ismember(s, 0:125:875);
selected = t(mask,:);
However, with that particular data, this can be shorted to
t = readtable('sample1.txt');
selected = t(1:2:end, :);
Trudat if is complete dataset...

Sign in to comment.

Categories

Asked:

on 28 Oct 2018

Commented:

dpb
on 4 Nov 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!