How to access unicode strings through MEX/Engine C interfaces?

8 views (last 30 days)
Short version
How can I access the underlying unicode data of MATLAB strings through the MATLAB Engine or MEX interface?
Here's an example. Let's put unicode characters in a UTF-8 encoded file test.txt, then read it as
fid=fopen('test.txt','r','l','UTF-8');
s=fscanf(fid, '%s')
If I first do feature('DefaultCharacterSet', 'UTF-8'), then engEvalString(ep, "s"), then I get back the text from the file as UTF-8. This proves that MATLAB stores it as unicode internally. However if I do mxArrayToString(engGetVariable(ep, "s")), I get what unicode2native(s, 'Latin-1') would give me in MATLAB: all non-Latin-1 characters replaces by code 26. What I need is getting access to the underlying unicode data as a C string in any unicode format (UTF-8, UTF-16, etc.) Is this possible?
Addendum: The doc page of mxArrayToString explicitly states that
  • "[mxArrayToString] supports multibyte encoded characters."
So how can I get the multibyte non-Latin-1 characters then?
----
Original long version What character encoding does MATLAB use internally---if any---and is there a way to control this? To be precise, I would like to know if there is a way to guarantee that any character array I retrieve is going to adhere to a particular encoding, preferably a unicode one.
I am interfacing MATLAB with another library through the MATLAB Engine interface and I need to guarantee a character encoding when sending strings to the other library. Is this possible at all, or are MATLAB's strings plain char arrays with no associated encoding?
Related things I found:
  • This here says that it uses UTF-16, but that's not what I see when I retrieve strings in C code.
  • I found references to feature('DefaultCharacterEncoding', 'UTF-8') on the web. What this appears to do is control what encoding the input commands (engEvalString) are assumed to have, and how the output is encoded. If I supply a UTF-8 encoded á as s='á', then retrieve this in C, I get an ISO-Latin-1 encoded á. If I send something that's not in Latin-1, I get nonsense (actually character code 26). (At least this is my impression after a few simple tests---these are time consuming)
In light of this finding, I'd like to know: does MATLAB support unicode for all its strings? If yes, how do I get access to these from the C interface? (Any unicode encoding is acceptable, UTF8, UTF16, UCS32, etc.) If it doesn't support unicode, is ISO-Latin-1 its default? Can I assume that all strings I retrieve though the C interface can be interpreted as ISO-Latin-1?
Also, any pointers to the relevant documentation on the issue is most welcome.
(I should probably mention that I was testing this on OS X as I'm aware that there are differences in the implementation of the matlab engine interface between platforms.)

Accepted Answer

Jan
Jan on 18 Feb 2013
Sorry, this does not match your question exactly, but perhaps it is useful for the topic.
See also: Answers: Matlab string to wchar. I got this message form the support:
I believe mxChar was originally intended to be UTF-16, however the surrogate pair style unicode characters do not appear to be fully supported. However I suspect passing these characters through MATLAB 'mxChar' to the operating system should still be fine as MATLAB links against ICU (International Components for Unicode).
For compilers that have 'wchar_t' as a 16-bit value and use encoding schemes UTF-16 / UCS-2, this code will be safe.
For 32-bit 'wchar_t' values, you would need to do a conversion from UTF-16 to the encoding scheme employed by the operating system. For basic MATLAB strings to UTF-32, you could potentially just leave the upper 16-bits at zero. However as you expect, there may be certain strings obtained from the operating system that are in surrogate pair form, which require a slightly more advanced conversion. It may be better to utilize a separate library such as ICU to do the conversion between UTF-16 and the Linux encoding scheme.
  2 Comments
Szabolcs
Szabolcs on 18 Feb 2013
This is actually very useful, as well as this, which I did not find until now. I did not realize that mxChars were two (or four) bytes, unfortunately. Also, the library I am interfacing with does not support surrogate pairs either (which is pretty annoying in general, but somewhat convenient in this case).
I'll accept once I manage to get things working using this information.
Szabolcs
Szabolcs on 18 Feb 2013
@Jan, I know this is not related, but since you seem to have a lot of experience with the C interface, could you take a look at this? Have you ever tried to transfer "class" type values? When I attempt this, MATLAB crashes. If you don't have experience with this, please just ignore this comment. (Background: I'm interfacing MATLAB with another language, so all the unusual edge cases are coming out...)

Sign in to comment.

More Answers (1)

Walter Roberson
Walter Roberson on 18 Feb 2013
MATLAB uses a 16 bit character internally, but it does not use UTF-anything. It simply uses the first 65536 Unicode code points.

Categories

Find more on Characters and Strings in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!