How to solve issue with strncmp returning incorrect logical values for text comparison...

Question

0 votes

textfile.txt

I need to scan through a textfile line by line and pull out numerical variables corresponding to given line beginnings (e.g. save subjectnumber = 2 for the line 'Subject number: 2').

I am currently attempting to do this by loading the file with fopen, then using fget1 to work through the file 1 line at a time, and comparing the relevant amount of characters at the beginning of each line with saved strings which act as 'keys', using the strncmp function: If the first 'n' characters of the line match the key, the script would then save the numerical value as a variable in the workspace, to later incorporate into the final data structure.

However the strncmp function does not seem to be working correctly, and I cannot figure out why. Regardless of whether I compare Char array to Char array, or convert to Strings before comparison, the function returns a logical '0' even when the key matches the line. I can copy and paste the retrieved line from the document to the command window, use this to test the strncmp function against the key variable saved in the workspace, and get a logical '1' true result. However in the script itself, the function always returns logical '0'.

Has anybody encountered a similar issue before?

fileID = fopen('textfile.txt');
subkey = " S u b j e c t :";
while ischar(tline)
     tline = string(fgetl(fileID))  % get next line & convert to string
     submatch = strncmp(tline, subkey, length(subkey))  %check match for subject key PROBLEM LINE
     if submatch == 1
          % code to save numerical variable
     end
end

The printed output for fgetl for the line containing the desired information, and the subsequent strncmp check is:

tline =

    " S u b j e c t :   1 "    % i.e. identical to the specified key 'subkey' over the first 16 character

submatch =

logical
   0

2 Comments
Show None Hide None

Stephen23 on 13 Jun 2018

Edited: Stephen23 on 13 Jun 2018

@nicholas oscar davy: please upload a sample file by clicking the paperclip button.

Without your data file to work with then we cannot test your code, or investigate what is happening.

nicholas oscar davy on 13 Jun 2018

Thanks Stephen,

I've attached a truncated version of the file I need to process.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Stephen23 on 13 Jun 2018

Edited: Stephen23 on 13 Jun 2018

Open in MATLAB Online

1 vote

Explanation: The problem is caused by the file encoding, which is little-endian UCS-2, a two-byte character encoding. So what you see as two separate characters (a letter followed by a space) is actually one single two-byte character inside the file. Combining string into the mix just confuses things even more, but does not change this fundamental issue with reading the file.

The reason that your string with space characters does not match is because what you see as space characters (i.e. ASCII 32) and used in subkey are not really spaces at all in the imported data: they are interpreted as NULL characters (ASCII 0) (of course they are not really characters at all, just the trailing byte of a two-byte character). For example the first line of the file apparently contains this (note all the NULL "characters"):

>> +tline
ans =
  Columns 1 through 21
   729   355    42     0    42     0    42     0    32     0    72     0   101     0    97     0   100     0   101     0   114
  Columns 22 through 42
     0    32     0    83     0   116     0    97     0   114     0   116     0    32     0    42     0    42     0    42     0

Also note the first few bytes are a quite large: these contain information which tells us about the byte order, and implies something about the file encoding.

6 Comments
Show 4 older comments Hide 4 older comments

nicholas oscar davy on 13 Jun 2018

Agreed that this is the most secure solution. I'm basically extracting inherited data from text file format to save as mat files, and for future analysis I should be able to ensure these text files are saved as UTF8 from the beginning. Thanks

Walter Roberson on 13 Jun 2018

Edited: Walter Roberson on 13 Jun 2018

Minor technical quibble: the file uses the UTF16LE Byte Order Mark, which is defined by Unicode but not UCS-2, so it is Unicode rather than UCS-2. Unicode defines the file encoding standards as well as the codepoints; UCS-2 defines the codepoints.

Sign in to comment.

How to solve issue with strncmp returning incorrect logical values for text comparison...

2 Comments
Show None Hide None

Accepted Answer

6 Comments
Show 4 older comments Hide 4 older comments

More Answers (0)

Categories

Tags

Community Treasure Hunt

How to solve issue with strncmp returning incorrect logical values for text comparison...

2 Comments Show None Hide None

Accepted Answer

6 Comments Show 4 older comments Hide 4 older comments

More Answers (0)

Categories

Tags

See Also

Community Treasure Hunt

2 Comments
Show None Hide None

6 Comments
Show 4 older comments Hide 4 older comments