How to solve issue with strncmp returning incorrect logical values for text comparison...

I need to scan through a textfile line by line and pull out numerical variables corresponding to given line beginnings (e.g. save subjectnumber = 2 for the line 'Subject number: 2').
I am currently attempting to do this by loading the file with fopen, then using fget1 to work through the file 1 line at a time, and comparing the relevant amount of characters at the beginning of each line with saved strings which act as 'keys', using the strncmp function: If the first 'n' characters of the line match the key, the script would then save the numerical value as a variable in the workspace, to later incorporate into the final data structure.
However the strncmp function does not seem to be working correctly, and I cannot figure out why. Regardless of whether I compare Char array to Char array, or convert to Strings before comparison, the function returns a logical '0' even when the key matches the line. I can copy and paste the retrieved line from the document to the command window, use this to test the strncmp function against the key variable saved in the workspace, and get a logical '1' true result. However in the script itself, the function always returns logical '0'.
Has anybody encountered a similar issue before?
fileID = fopen('textfile.txt');
subkey = " S u b j e c t :";
while ischar(tline)
tline = string(fgetl(fileID)) % get next line & convert to string
submatch = strncmp(tline, subkey, length(subkey)) %check match for subject key PROBLEM LINE
if submatch == 1
% code to save numerical variable
end
end
The printed output for fgetl for the line containing the desired information, and the subsequent strncmp check is:
tline =
" S u b j e c t : 1 " % i.e. identical to the specified key 'subkey' over the first 16 character
submatch =
logical
0

2 Comments

@nicholas oscar davy: please upload a sample file by clicking the paperclip button.
Without your data file to work with then we cannot test your code, or investigate what is happening.
Thanks Stephen,
I've attached a truncated version of the file I need to process.

Sign in to comment.

 Accepted Answer

Explanation: The problem is caused by the file encoding, which is little-endian UCS-2, a two-byte character encoding. So what you see as two separate characters (a letter followed by a space) is actually one single two-byte character inside the file. Combining string into the mix just confuses things even more, but does not change this fundamental issue with reading the file.
The reason that your string with space characters does not match is because what you see as space characters (i.e. ASCII 32) and used in subkey are not really spaces at all in the imported data: they are interpreted as NULL characters (ASCII 0) (of course they are not really characters at all, just the trailing byte of a two-byte character). For example the first line of the file apparently contains this (note all the NULL "characters"):
>> +tline
ans =
Columns 1 through 21
729 355 42 0 42 0 42 0 32 0 72 0 101 0 97 0 100 0 101 0 114
Columns 22 through 42
0 32 0 83 0 116 0 97 0 114 0 116 0 32 0 42 0 42 0 42 0
Also note the first few bytes are a quite large: these contain information which tells us about the byte order, and implies something about the file encoding.
You might like to read this:
Solutions:
  • save the file as UTF8, and then you won't have any problems.
  • fopen the file telling MATLAB that it uses two bytes per character, e.g.:
fileID = fopen('textfile.txt','rt','n','UTF16');
Which when I test it using R2012b gives this:
>> tline
tline =
*** Header Start ***
>> +tline
ans =
42 42 42 32 72 101 97 100 101 114 32 83 116 97 114 116 32 42 42 42

6 Comments

I suspected encoding might be causing problems, as you can see the read lines have spaces where there are none in the original document, however I thought inserting spaces into the 'key' strings would remedy this. I will read the link and see if converting to UTF8 remedies the issue.
Im running MATLAB R2018a
The second option (using additional inputs to fopen) worked as a charm!
Thank you very much!
@nicholas oscar davy: I hope that it helps. You can accept my answer if it resolves your question.
The second option (using additional inputs to fopen) worked as a charm!
I would strongly recommend that you go with the first option though. Note that the 'UTF16', 'UTF16LE' and 'UTF16BE' fopen flags are not officially supported in matlab and are not documented. They may stop working in a future version or there may be some unknown problems with them. UTF8 is officially supported by matlab and is also more widely supported by other programs.
Agreed that this is the most secure solution. I'm basically extracting inherited data from text file format to save as mat files, and for future analysis I should be able to ensure these text files are saved as UTF8 from the beginning. Thanks
Minor technical quibble: the file uses the UTF16LE Byte Order Mark, which is defined by Unicode but not UCS-2, so it is Unicode rather than UCS-2. Unicode defines the file encoding standards as well as the codepoints; UCS-2 defines the codepoints.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!