How can I access the underlying unicode data of MATLAB strings through the MATLAB Engine or MEX interface?
Here's an example. Let's put unicode characters in a UTF-8 encoded file test.txt, then read it as
fid=fopen('test.txt','r','l','UTF-8'); s=fscanf(fid, '%s')
If I first do feature('DefaultCharacterSet', 'UTF-8'), then engEvalString(ep, "s"), then I get back the text from the file as UTF-8. This proves that MATLAB stores it as unicode internally. However if I do mxArrayToString(engGetVariable(ep, "s")), I get what unicode2native(s, 'Latin-1') would give me in MATLAB: all non-Latin-1 characters replaces by code 26. What I need is getting access to the underlying unicode data as a C string in any unicode format (UTF-8, UTF-16, etc.) Is this possible?
Addendum: The doc page of mxArrayToString explicitly states that
So how can I get the multibyte non-Latin-1 characters then?
Original long version What character encoding does MATLAB use internally---if any---and is there a way to control this? To be precise, I would like to know if there is a way to guarantee that any character array I retrieve is going to adhere to a particular encoding, preferably a unicode one.
I am interfacing MATLAB with another library through the MATLAB Engine interface and I need to guarantee a character encoding when sending strings to the other library. Is this possible at all, or are MATLAB's strings plain char arrays with no associated encoding?
Related things I found:
In light of this finding, I'd like to know: does MATLAB support unicode for all its strings? If yes, how do I get access to these from the C interface? (Any unicode encoding is acceptable, UTF8, UTF16, UCS32, etc.) If it doesn't support unicode, is ISO-Latin-1 its default? Can I assume that all strings I retrieve though the C interface can be interpreted as ISO-Latin-1?
Also, any pointers to the relevant documentation on the issue is most welcome.
(I should probably mention that I was testing this on OS X as I'm aware that there are differences in the implementation of the matlab engine interface between platforms.)
Sorry, this does not match your question exactly, but perhaps it is useful for the topic.
See also: Answers: Matlab string to wchar. I got this message form the support:
I believe mxChar was originally intended to be UTF-16, however the surrogate pair style unicode characters do not appear to be fully supported. However I suspect passing these characters through MATLAB 'mxChar' to the operating system should still be fine as MATLAB links against ICU (International Components for Unicode).
For compilers that have 'wchar_t' as a 16-bit value and use encoding schemes UTF-16 / UCS-2, this code will be safe.
For 32-bit 'wchar_t' values, you would need to do a conversion from UTF-16 to the encoding scheme employed by the operating system. For basic MATLAB strings to UTF-32, you could potentially just leave the upper 16-bits at zero. However as you expect, there may be certain strings obtained from the operating system that are in surrogate pair form, which require a slightly more advanced conversion. It may be better to utilize a separate library such as ICU to do the conversion between UTF-16 and the Linux encoding scheme.