Remove elements appearing sequentially in a larger text.

1 view (last 30 days)
Hello.
I just started working as an engineer, and was recently tasked with the boring task of editing. I figure this is something that can be done in Matlab, but my brief class during the studies leaves me with only the most basic (if that).
The reason why I believe this should be rather easy is that the data is sequentially arranged, with each sequence being about 2 pages long and identical in form. No rows are to be partially edited, so this would be the simplified case, left being the data and right being the edited output:
1 a 1 a
2 b 2 c
3 c 3 a
4 a 4 c
5 b
6 c
[...]
This pattern is repeating itself a couple of hundred times, so some kinda loop has to be implemented if this is going to be quicker than just cut and paste.
There are both numbers, characters and tables.
Thanks,
Tord
  4 Comments
Tord
Tord on 10 Jun 2014
This being classified data and me being a new employee, no. But I can illustrate for you:
[start] Analysis nr: 1234
Name: Example
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Circle index: 1111
[end] - then repeat hundreds of times.
And in every one of them I want to remove ie "Name: Example" and "circle index 1111" (random selection).
With dpb's answer I guess it would read: nr=length(10); ix=unique([1:2:nr] (1:6:nr)]; file(ix)=[];
I suddenly started to wonder if this is all I need. I will try.
Cedric
Cedric on 10 Jun 2014
Edited: Cedric on 10 Jun 2014
If all blocks have same length, same number of characters, etc, you can remove periodically lines with a fixed period. If blocks can vary a bit in length, you either have to analyze line by line and take a decision or perform pattern matching and replacement.
For pattern matching, see my answer.

Sign in to comment.

Accepted Answer

dpb
dpb on 10 Jun 2014
Edited: dpb on 10 Jun 2014
...Point is that I want to remove i.e every third and seventh row in these datasets...
If the data are regular in line location(s) (that is, don't have to search for a pattern to locate sections), then it's pretty simple --
A) read the file into a cell array of character data--
file = textread('yourfile', '%s', 'delimiter', '\n', 'whitespace', '');
B) delete the lines not wanted...I'm not positive precisely the definition of "every third and seventh row" but assuming it's the joint combination of [1:2:end] and [1:6:end] then
nr=length(file); % number rows in file
ix=unique([1:2:nr] [1:6:nr]); % selected rows to delete
file(ix)=[]; % remove the rows unwanted
C) rewrite to a file -- NB: either create a backup first or be sure to create a new copy on writing while debugging!!!
You can do in a single step if you can define a rule for any arbitrary set of lines to be deleted that are fixed in relationship to the beginning of the file no matter how complex that rule might be.
ADDENDUM
Following your example of a file, I made a local file of some number of repetitions of same...
>> file=textread('file.txt', '%s', 'delimiter', '\n', 'whitespace','')
file =
'Analysis nr: 1234'
'Name: Example'
'Center of circle is blue. Radius of circle = 10'
'Curve; [table of numbers]'
'Circle index: 1111'
'Analysis nr: 1234'
'Name: Example'
'Center of circle is blue. Radius of circle = 10'
'Curve; [table of numbers]'
'Circle index: 1111'
'Analysis nr: 1234'
'Name: Example'
'Center of circle is blue. Radius of circle = 10'
'Curve; [table of numbers]'
'Circle index: 1111'
'Analysis nr: 1234'
'Name: Example'
'Center of circle is blue. Radius of circle = 10'
'Curve; [table of numbers]'
'Circle index: 1111'
'Analysis nr: 1234'
'Name: Example'
'Center of circle is blue. Radius of circle = 10'
'Curve; [table of numbers]'
'Circle index: 1111'
'Analysis nr: 1234'
'Name: Example'
'Center of circle is blue. Radius of circle = 10'
'Curve; [table of numbers]'
'Circle index: 1111'
>> nr=length(file);
>> ix=sort([[2:5:nr] [5:5:nr]]); % no unique; this pattern has no overlap
>> file(ix)=[];
>> fid=fopen('file1.txt','w');
>> for i=1:length(file),fprintf(fid,'%s\n',file{i});end
>> fid=fclose(fid);
>> type file1.txt
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
>>
Voila! Joy ensues... :)
  4 Comments
dpb
dpb on 10 Jun 2014
There's nothing about the overlap that's a problem as long as the indices aren't duplicated to erroneously remove a row at the wrong location. The invocation of unique was mostly a nicety to remove duplicates and to as a corollary sort the index array which should help runtime. In effect the net result is the same either way it just looks cleaner with rather than without.
If you do need the pattern-matching solution, Cedric's the undoubted whizard on regular expressions while I'm a feeb there...but for the "deadahead" case that you seem to have, this is far the quicker.
If it does solve the problem, please go ahead and Accept the answer so we know to close the issue.
Tord
Tord on 11 Jun 2014
Yes, overlap could become an issue because I noticed at least one line that did not match exactly the template I used (small construction difference).
I just tried, and failed, with the unique function implemented. I will try some more and then accept it regardless of outcome, I understand that this is below what you guys want to use your time on.
Once again, thanks.

Sign in to comment.

More Answers (2)

Cedric
Cedric on 10 Jun 2014
Here is an example using pattern matching and replacement..
% - Get and modify content.
fName = 'tord_1.txt' ;
content = fileread( fName ) ;
content = regexprep( content, 'Name:[^\n]*\n', '' ) ;
content = regexprep( content, 'Circle index:[^\n]*\n', '' ) ;
% - Output modified version to file.
[fPath, fBase, fExt] = fileparts( fName ) ;
fId = fopen( fullfile( fPath, [fBase, '_modified', fExt] ), 'w' ) ;
fwrite( fId, content ) ;
fclose( fId ) ;
  12 Comments
Cedric
Cedric on 12 Jun 2014
Hey Tord, don't be sorry, we most likely would have done the same thing (assuming regularity until we observe a shift), and that is how we learn after all ;-)
I'll go on with pattern matching by email.
Tord
Tord on 16 Jun 2014
You both being so understanding means a great deal to me, thank you both.
As I told Cedric by mail, I have not been connected during the weekend and now I have to prioritize other tasks at work. I sat the whole night to Friday editing the text by hand.
But this task needs to be done every now and the, thus I will continue working on this so it will be ready till then.
I will look into this later today and keep you all posted on the progress - both what is being done and to what degree I actually understand it.

Sign in to comment.


dpb
dpb on 11 Jun 2014
Edited: dpb on 11 Jun 2014
OK, try this...this is a "deadahead" looping solution to build the vector from the information provided --it can be made to look "more Matlaby" but this I could do before my meeting...
Starting with your block definitions and the overall length of the repetitive section...
>> ix=[1 49; 77 85; 106 114; 141 147] % the sections to remove
ix =
1 49
77 85
106 114
141 147
>> N=170; % the overall block length
>> L=42000;
Following is a sanity check to compare lengths to your given ...
>> ceil(L/N)
ans =
248
>> 248*N
ans =
42160
>> L=ans; % sanity check I did on overall lengths
The above look right I presume???
Anyway, back to the building of an overall deletion index...
>> ig=[];for i=1:size(ix,1),ig=[ig; [ix(i,1):ix(i,2)].'];end % One block
Then build the whole thing from repeating the above for the number of blocks in a file
>> ix=ig; % initialize to the first group
>> for i=1:L/N-1 % loop count from 2:L/N
ix=[ix; (i*170)+ig]; % 1:L/N-1 instead of (i-1) as multiplier
end % add the group plus offset and concatenate
Now use ix as the index vector to delete those lines as shown previously. Again, be sure to have a backup while you double-check your counts, etc., before you overwrite the raw data files!!! :)
Another sanity check...
>> L-ix(end)
ans =
23
>> 170-147
ans =
23
Lookin' good... :)
I gotta' run...good luck!
ADDENDUM:
L as above should match length(file), btw as the verification of the counting...
ADDENDUM 2:
Just as a sidepoint, the multiplications can be done away with, also...
for i=2:L/N % loop count from 2:L/N
ig=ig+170; % add the offset
ix=[ix; ig];
end
To make the script simpler to adapt to other files, move the 170 constant also to a variable that you can set at the top--then you change only those constants that define the file structure and you're done for any other similarly-constructed files.
And to look ahead a little, next you'll be looking for the answer at the FAQ --
:)
ADDENDUM 3 and (hopefully) final:
Not to be outdone by Cedric ( :) ), the vectorized solution for building the deletion index array --
Given the above ix array of unwanted lines and the block size N and file length L--
ig=cell2mat(arrayfun(@colon,ix(:,1),ix(:,2),'uniformoutput',false).').';
ix= bsxfun(@plus,N*[0:L/N-1],repmat(ig,1,L/N)); ix=ix(:);
  4 Comments
Image Analyst
Image Analyst on 14 Jun 2014
I'd be interested to know if you've found a use for MATLAB on your farm. For example to control a weather station or see if the animals are back in the barn yet or something. Maybe interfaced an arduino....
dpb
dpb on 14 Jun 2014
I've not to date other than somewhat superficially altho I had some ideas of it when TMW generously comp'ed the upgraded version but I've not actually done anything along those lines.
There's an opportunity there I think for the future even more integration of the various data sources. The biggest difference from when left for college and the off-farm career in the mid-60s and when returned besides just the increased size of typical operation which is simply scaling is the amazing use of technology in everything from GPS auto-steer and tracking to yield monitors and planters that can actually place an individual seed spacing to within 1/8" for precise planting rates as well as control side dressings and fertilizers/pesticides/herbicides on a rate that is also tied to soil conditions and other field topographical features. I've just not taken the time to do it outside the available features in the vendor-supplied software/firmware interfaces.

Sign in to comment.

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!