Removing unwanted lines from text file

6 views (last 30 days)
I am trying to remove all the NaN from column 7 of the attached text file and move them into a new text file.
I have written the code below:
% - Read original.
content = fileread( 'virgorm.txt' ) ;
% - Match and eliminate lines without pattern matching.
sepId = reshape( strfind( content, '|' ), 7, [] ) ;
match = content(sepId(7,:)+1) == 'NaN' ;
lines = strsplit( content, '\n' ) ;
lines(match) = [] ;
% - Export updated content.
fId = fopen( 'virgormwou.txt', 'w' ) ;
fprintf( fId, strjoin( lines, '\n' )) ;
fclose( fId ) ;
But, it doesn't seem to be working. I suspect it is because of line:
match = content(sepId(7,:)+1) == 'NaN' ;
The error I get is:
Error using reshape Product of known dimensions, 7, not divisible into total number of elements, 6492.

Accepted Answer

Cedric
Cedric on 6 Aug 2015
Edited: Cedric on 6 Aug 2015
Not far! You made two small mistakes actually. The first is that you have 7 columns, and hence 6 separators, so the array of separators IDs must be reshaped using 6 rows:
sepId = reshape( strfind( content, '|' ), 6, [] ) ;
Then you cannot test is one char/element equals 'NaN' the way you do. I would just check for the presence of 'N' after the 6th separator:
found = content(sepId(6,:)+1) == 'N' ;
Finally, and I renamed the variable match into found for that purpose (which should remind you one of your previous questions), you can split and export to two files as follows:
lines = strsplit( content, '\n' ) ; *** UPDATED: I forgot to copy this line.
fId = fopen( 'output_nan.txt', 'w' ) ;
fprintf( fId, strjoin( lines(found), '\n' )) ;
fclose( fId ) ;
fId = fopen( 'output_noNan.txt', 'w' ) ;
fprintf( fId, strjoin( lines(~found), '\n' )) ;
fclose( fId ) ;
  4 Comments
Cedric
Cedric on 6 Aug 2015
Did you change the sepId(7,:) intto sepId(6,:) as well?
Cedric
Cedric on 6 Aug 2015
You should work on an small example actually, to get a better understanding of what we do:
>> buffer = sprintf( '1|3|2|~|7\n2|1|5|~|12\n3|2|28|~|137' )
buffer =
1|3|2|~|7
2|1|5|100|12
3|2|28|~|137
This creates a string of characters which has the same structure as your files. The \n is an escape code that creates a new line.
Now we can look for the positions/IDs of | in this string:
>> strfind( buffer, '|' )
ans =
2 4 6 8 12 14 16 20 25 27 30 32
and you can check that it works if you count the new line as a single character. You can see what is the ASCII code of all these characters by the way, by converting to numeric (adding 0 triggers an automatic conversion to numeric):
>> buffer + 0
ans =
49 124 51 124 50 124 126 124 55 10 50 124 49 124 53 124 49 48 48 124 49 50 10 51 124 50 124 50 56 124 126 124 49 51 55
Here, 49 is the ASCII code of '1', 51 is the ASCII code of '3', 124 is the ASCII code of '|', and 10 is the ASCII code that codes for new lines. The shows that SPRINTF codes '\n' with 10, which is a single character.
Back to positions, accounting for the fact that new lines are single characters, you can check that positions work. Now if we want to get the position of all 3rd | on each line, we can compute the start and the step for extracting relevant positions. Another way is to create an array whose number of columns equals the number of | on a line, which means to reshape the vector of positions as follows:
>> sepId = reshape( strfind( buffer, '|' ), 4, [] )
sepId =
2 12 25
4 14 27
6 16 30
8 20 32
Here we get is transposed, but you recognize in the first column all positions associated with line 1, in the second column all positions associated with line 2, etc. So getting positions/IDs associated with the 3rd | means extracting row 3 of this array:
>> sepId(3,:)
ans =
6 16 30
Now we can get the character that follows immediately by extracting elements of buffer at these positions +1 :
>> buffer(sepId(3,:)+1)
ans =
~1~
and we can test whether these characters are '~' or not:
>> found = buffer(sepId(3,:)+1) == '~'
found =
1 0 1
Note that found is a vector of logicals (booleans: true noted 1, and false noted 0):
>> class( found )
ans =
logical
which means that we can create "not found":
>> ~found
ans =
0 1 0
We can use both for indexing arrays (logical indexing). If we want to index lines, we have to split buffer into lines, which we do with STRSPLIT using the new line as delimiter:
>> lines = strsplit( buffer, '\n' )
lines =
'1|3|2|~|7' '2|1|5|100|12' '3|2|28|~|137'
This is a cell array of lines/strings:
>> class( lines )
ans =
cell
and we can index its cells using a logical index (true=1 elements flag cells to extract):
>> lines(found)
ans =
'1|3|2|~|7' '3|2|28|~|137'
>> lines(~found)
ans =
'2|1|5|100|12'
Now we can export these to files, but we have to join lines with a new line character:
>> strjoin( lines(found), '\n' )
ans =
1|3|2|~|7
3|2|28|~|137
and the rest you know well, it's opening files for writing, writing, and closing files.

Sign in to comment.

More Answers (0)

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!