Problem opening files containing special characters

24 views (last 30 days)
It seems there is a problem specifying non-ascii characters in filenames to fopen. Do I need to encode these somehow?
Any help appreciated, dave r.
OSX10.6, English language, Swedish Region
>> feature('DefaultCharacterSet')
ans =
ISO-8859-1
>> getenv('LANG')
ans =
sv_SE.ISO8859-1
Now, suppose I have a file called 'öäå.txt' (which, if we have problems with encoding, are ISO 8859-1 characters 246,228,229 followed by .txt). In Matlab:
I want to open the file:
id = fopen('öäå.txt','r','n','UTF-8') id =
-1
As a workaround for a single file, I can use: >> D = dir('*.txt')
D =
name: 'oÌaÌaÌ.txt'
date: '17-Jun-2011 16:02:36'
bytes: 987
isdir: 0
datenum: 7.3467e+05
>> id = fopen(D.name,'r','n','UTF-8')
id =
3
but I would like a solution where I can actually specify the filename directly!

Accepted Answer

David Rayner
David Rayner on 20 Jun 2011
So following Walter's hint that this is actually UTF-8, we find that filenames on Macs are returned as decomposed form, whereas other systems use composed forms (or perhaps whatever was given by the user). http://download.oracle.com/javase/6/docs/api/java/text/Normalizer.html
I didn't find any way to handle this in matlab native, but Java provides the required methods:
%% some handy definitions
NFD = javaMethod('valueOf', 'java.text.Normalizer$Form','NFD');
NFC = javaMethod('valueOf', 'java.text.Normalizer$Form','NFC');
UTF8=java.nio.charset.Charset.forName('UTF-8');
%% convert a name of a file from dir to a sensible matlab string:
D = dir('*.txt');
s2 = D.name;
s = java.lang.String(uint8(s2),UTF8);
sc = java.text.Normalizer.normalize(s,NFC);
sc = char(sc);
strcmp(sc,'öäå.txt')
ans =
1
%% and the reverse, to open a file with accented characters:
filename = 'öäå.txt';
s = java.lang.String(filename);
sc = java.text.Normalizer.normalize(s,NFD);
bs=single(sc.getBytes(UTF8)');
bs(bs<0) = 256+(bs(bs<0));
id = fopen(char(bs),'r')
id =
3

More Answers (2)

Walter Roberson
Walter Roberson on 17 Jun 2011
Could I ask you to show us what
0 + D.name
shows? We might be able to find a unicode2native() formula that works.
  1 Comment
David Rayner
David Rayner on 17 Jun 2011
>> D.name + 0
ans =
111 204 136 97 204 136 97 204 138 46 116 120 116
I've played with that, but with no success, but I am sure there are combinations I have not tried!
Thanks!

Sign in to comment.


Walter Roberson
Walter Roberson on 17 Jun 2011
Bleh. This is an internal mess in Windows. http://stackoverflow.com/questions/3298569/difference-between-mbcs-and-utf-8-on-windows Which leads us to Windows' WideCharToMultiByte function http://msdn.microsoft.com/en-us/library/dd374130%28v=vs.85%29.aspx which leads us to the flag value WC_COMPOSITECHECK which is described as,
Convert composite characters, consisting of a base character and a nonspacing character, each with different character values. Translate these characters to precomposed characters, which have a single character value for a base-nonspacing character combination. For example, in the character è, the e is the base character and the accent grave mark is the nonspacing character.
Now what you have in the actual filename as reported by D.name is the composed version. The first byte 111 is 'o' and the second byte 204 means roughly "an accent for the previous character follows", and the third byte 136 tells you which accent, the dieresis in this case. Then the 97 byte tells you the base character of the next series is 'a' and the 204 tells you that the next character accents the previous, and the 136 is the " accent again. The 97 byte of the next sequence tells you the base character is 'a', the 204 tells you the next character accents the previous, and the 138 indicates a different accent, the small-circle in this case.
This is, I gather, the mechanism that would be used by one or more of the Windows Code Page encodings, but I have not searched to see if I can find details about this translation.
First you need to do some trial and error: construct a test filename with each character of interest and see how it comes out as a name.
Once you have the translation table worked out, you can use regexprep() to process the string -- but be sure to do 204 first as it will be introduced by the other translations.
===
Well, I'll be, there is some logic to it!
The encoding being used is valid UTF-8 (!!). Each of those characters has an alternate unicode decomposed version which is the base character followed by a code indicating the combining accent. The combining accents start at 0x0300 in unicode, the representation of which in UTF-8 is 0xCC (decimal 204) followed by 128 plus the offset relative to 0x0300. Have a look at http://www.fileformat.info/info/unicode/block/combining_diacritical_marks/utf8test.htm and hover over the 0x08 cell in the first row, and notice it is for 0x0308 COMBINING DIERESIS. 128+8 is 136, the byte we saw in the .name after the 204 (a byte indicating which UTF-8 range is in use.)
So there is logic to it, but my advice to use regexp() remains the same.

Categories

Find more on Characters and Strings in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!