MATLAB Answers

2

Problem opening files containing special characters

It seems there is a problem specifying non-ascii characters in filenames to fopen. Do I need to encode these somehow?
Any help appreciated, dave r.
OSX10.6, English language, Swedish Region
>> feature('DefaultCharacterSet')
ans =
ISO-8859-1
>> getenv('LANG')
ans =
sv_SE.ISO8859-1
Now, suppose I have a file called 'öäå.txt' (which, if we have problems with encoding, are ISO 8859-1 characters 246,228,229 followed by .txt). In Matlab:
I want to open the file:
id = fopen('öäå.txt','r','n','UTF-8') id =
-1
As a workaround for a single file, I can use: >> D = dir('*.txt')
D =
name: 'oÌaÌaÌ.txt'
date: '17-Jun-2011 16:02:36'
bytes: 987
isdir: 0
datenum: 7.3467e+05
>> id = fopen(D.name,'r','n','UTF-8')
id =
3
but I would like a solution where I can actually specify the filename directly!

  0 Comments

Sign in to comment.

3 Answers

Answer by David Rayner on 20 Jun 2011
 Accepted Answer

So following Walter's hint that this is actually UTF-8, we find that filenames on Macs are returned as decomposed form, whereas other systems use composed forms (or perhaps whatever was given by the user). http://download.oracle.com/javase/6/docs/api/java/text/Normalizer.html
I didn't find any way to handle this in matlab native, but Java provides the required methods:
%% some handy definitions
NFD = javaMethod('valueOf', 'java.text.Normalizer$Form','NFD');
NFC = javaMethod('valueOf', 'java.text.Normalizer$Form','NFC');
UTF8=java.nio.charset.Charset.forName('UTF-8');
%% convert a name of a file from dir to a sensible matlab string:
D = dir('*.txt');
s2 = D.name;
s = java.lang.String(uint8(s2),UTF8);
sc = java.text.Normalizer.normalize(s,NFC);
sc = char(sc);
strcmp(sc,'öäå.txt')
ans =
1
%% and the reverse, to open a file with accented characters:
filename = 'öäå.txt';
s = java.lang.String(filename);
sc = java.text.Normalizer.normalize(s,NFD);
bs=single(sc.getBytes(UTF8)');
bs(bs<0) = 256+(bs(bs<0));
id = fopen(char(bs),'r')
id =
3

  0 Comments

Sign in to comment.


Answer by Walter Roberson
on 17 Jun 2011

Could I ask you to show us what
0 + D.name
shows? We might be able to find a unicode2native() formula that works.

  1 Comment

>> D.name + 0
ans =
111 204 136 97 204 136 97 204 138 46 116 120 116
I've played with that, but with no success, but I am sure there are combinations I have not tried!
Thanks!

Sign in to comment.


Answer by Walter Roberson
on 17 Jun 2011

Bleh. This is an internal mess in Windows. http://stackoverflow.com/questions/3298569/difference-between-mbcs-and-utf-8-on-windows Which leads us to Windows' WideCharToMultiByte function http://msdn.microsoft.com/en-us/library/dd374130%28v=vs.85%29.aspx which leads us to the flag value WC_COMPOSITECHECK which is described as,
Convert composite characters, consisting of a base character and a nonspacing character, each with different character values. Translate these characters to precomposed characters, which have a single character value for a base-nonspacing character combination. For example, in the character è, the e is the base character and the accent grave mark is the nonspacing character.
Now what you have in the actual filename as reported by D.name is the composed version. The first byte 111 is 'o' and the second byte 204 means roughly "an accent for the previous character follows", and the third byte 136 tells you which accent, the dieresis in this case. Then the 97 byte tells you the base character of the next series is 'a' and the 204 tells you that the next character accents the previous, and the 136 is the " accent again. The 97 byte of the next sequence tells you the base character is 'a', the 204 tells you the next character accents the previous, and the 138 indicates a different accent, the small-circle in this case.
This is, I gather, the mechanism that would be used by one or more of the Windows Code Page encodings, but I have not searched to see if I can find details about this translation.
First you need to do some trial and error: construct a test filename with each character of interest and see how it comes out as a name.
Once you have the translation table worked out, you can use regexprep() to process the string -- but be sure to do 204 first as it will be introduced by the other translations.
===
Well, I'll be, there is some logic to it!
The encoding being used is valid UTF-8 (!!). Each of those characters has an alternate unicode decomposed version which is the base character followed by a code indicating the combining accent. The combining accents start at 0x0300 in unicode, the representation of which in UTF-8 is 0xCC (decimal 204) followed by 128 plus the offset relative to 0x0300. Have a look at http://www.fileformat.info/info/unicode/block/combining_diacritical_marks/utf8test.htm and hover over the 0x08 cell in the first row, and notice it is for 0x0308 COMBINING DIERESIS. 128+8 is 136, the byte we saw in the .name after the 204 (a byte indicating which UTF-8 range is in use.)
So there is logic to it, but my advice to use regexp() remains the same.

  0 Comments

Sign in to comment.