What are the internal differences between Matlab strings and character arrays?
Show older comments
Matlab strings were introduced in 2016b presumably to make working with a string more similar to other languages, as compared to character arrays.
Although the documentation clearly states most details that people will want to know about a string, I'm a bit unclear as to how a string and character array are different, other than having different methods.
Presumably strings are still encoded using UTF-16. Although I haven't tried it, I wouldn't expect that mex files support strings.
Also in this post, https://blogs.mathworks.com/loren/2016/09/15/introducing-string-arrays/, Loren mentions that "string arrays" are "more efficient" for storing text data. Why? (This might be a string array question and not something specific to strings vs chars).
6 Comments
James Tursa
on 23 Oct 2017
Are string arrays really just cell arrays of character vectors?:
S = ["123","456"];
cellfun(@class,S,'uni',0) % CELLFUN works without error !
arrayfun(@class,S,'uni',0) % ARRAYFUN parses each element of the string array.
James Tursa
on 6 May 2021
This is a strange result. Strings are stored fundamentally differently than char arrays, and my belief is that an array of strings is not stored the same as a cell array. If they were then I could get at them in a mex routine with the API cell functionality, but I can't. I don't know why cellfun would report the elements are char type.
Steven Lord
on 6 May 2021
I believe you are correct, that string arrays are not stored as cell arrays. As for why cellfun can accept them, that's because it's documented that it can accept a string array. The description of the C input argument on the documentation page for the cellfun function is "Input array, specified as a cell array or a string array. If C is a string array, then cellfun treats each element of C as though it were a character vector, not a string."
I believe that was done to facilitate migration of code that used to call cellfun on cell arrays containing char vectors (a cellstr) to use string arrays instead. Making cellfun accept string arrays would require less effort for users than requiring those users to replace the cellstrs with string arrays and change cellfun to arrayfun in quite possibly years or decades of existing code.
James Tursa
on 6 May 2021
Edited: James Tursa
on 6 May 2021
@Steven Lord The robust functionality for users I understand. What I can't understand is why cellfun would report the class of the elements as char.
[This may seem like a non sequiter at first, but bear with me for a moment.] Technically, the matrix product A*B is only mathematically defined if the number of columns in A matches the number of rows in B. But MATLAB breaks that rule slightly by allowing either A or B to be a scalar even if the other is not a vector. It does so in order to not force users to create (potentially large) temporary arrays.
n = 6; % Imagine if n were 600 or 6000 instead of 6
A1 = 5;
A2 = diag(repmat(A1, 1, n));
B = magic(n);
C1 = A1*B;
C2 = A2*B;
isequal(C1, C2) % true
whos % A2 is larger than A1
Technically, cellfun should only operate on cell arrays including cellstrs and should throw an error if you pass in a string array. However, in order to not force users to convert string arrays into cellstrs manually (the equivalent of creating A2 from A1) which too could be expensive in terms of time and/or memory cellfun treats those strings like they were cellstrs. This includes treating them like cellstrs for the purposes of calling class on the contents.
C = {'abc', 'def'};
cellResult = cellfun(@class, C, 'UniformOutput', false)
Cs = string(C);
stringResult = cellfun(@class, Cs, 'UniformOutput', false) % Treats Cs like it were C
Accepted Answer
More Answers (3)
Steven Lord
on 23 Oct 2017
For purposes of this answer, I'm going to use the word "phrase" kind of liberally to mean a chunk of textual data. That could be a character, a word, a sentence or phrase, a book, etc. A couple of differences that make string arrays more efficient to work with than char arrays:
- A string array treats each phrase as a unit, whereas a char array treats each character as a unit. In the past we've seen plenty of people do something like this:
c = 'apple';
f = c(1) % expecting f to be 'apple', but it is 'a'
With a string:
s = "apple";
f = s(1) % expecting f to be "apple", which it is
- Storing phrases of different lengths in a char matrix requires padding with blanks. This means you need to remove the padding when you want to use each phrase later on. A string array doesn't require this padding.
c = ['apple '; 'banana'; 'cherry'];
c = strvcat(c, 'watermelon');
size(c)
f = ['{' c(1, :) '}'] % Note the extra spaces between the {}
s = ["apple"; "banana"; "cherry"];
s = [s; "watermelon"];
size(s)
f1 = ['{' s(1) '}'] % Note that f1 is now a 1x3 array; each of the braces is a separate string
f2 = '{' + s(1) + '}' % Note no extra spaces between the braces and the phrase apple
- In the past one way to store multiple char arrays of different lengths without padding was to store them in a cell array. But MATLAB functions that needed to process the textual data would need to check (using something like iscellstr) whether or not every element of the cell contained a char vector. That checking takes time. A string array can only contain string data, so it doesn't need to check each element in the array for "string-ness". That extra validation probably doesn't take a lot of time, unless you need to do it often and/or on a large cell of char data.
- Regarding MEX-file support, I'm not certain. If you want to request MEX-file support for string arrays, or learn what support there is (nothing's listed in the documentation as far as I could find) I recommend contacting Technical Support directly using the Contact Us link in the upper-right corner of this page.
Yair Altman
on 24 Oct 2017
Edited: Yair Altman
on 24 Oct 2017
4 votes
The new strings are simply Matlab classes (MCOS objects), that extend 3 superclasses (matlab.mixin.internal.MatrixDisplay, matlab.mixin.internal.indexing.Paren, matlab.mixin.internal.indexing.ParenAssign). The ability to use double quotes (") to signify strings (as in s="apple") is simply syntactic sugar for the new string class. As a class object, the new strings defines 3 dozen internal class methods, such as cellstr(), char(), split() etc.
The string class is defined with class attributes Sealed and RestrictsSubclassing, to ensure that nobody can override its behavior. Moreover, TMW was extra-careful (way more that it usually is) to close most of the doors that can be used to access the internals. It's no wonder that MathWorkers on this page ignore the explicit repeated requests for information about the internals.
The internal string data is stored inside a class property called "data", which is private and hidden and so is not regularly accessible except via the class methods. If you want to access it, you can't simply use struct(), but you could try using James Tursa's mxGetPropertyPtr, as explained here: https://undocumentedmatlab.com/blog/accessing-private-object-properties
As for the discussion above regarding the specific UTF representation, I think that you will find the following discussion interesting, especially in the comments thread: https://undocumentedmatlab.com/blog/couple-of-matlab-bugs-and-workarounds
2 Comments
Stephen23
on 24 Oct 2017
+1 for the breakdown. Hopefully this will prompt more investigation.
Jim Hokanson
on 24 Oct 2017
Edited: Jim Hokanson
on 24 Oct 2017
Sruthi Geetha
on 23 Oct 2017
2 votes
First of all Strings in MATLAB are introduced in R2017a.
The main difference between strings and character arrays is that strings can be considered a complete object, where as character arrays are a vector of chars. Therefore, the latter you can access individual characters via indexing whereas in the former case, you cannot. Example:
>> s = "hi"
s = "hi"
>> sc = 'hi'
sc = 'hi'
>> sc(1)
ans = 'h'
>> s(1)
ans = "hi"
>> s(2)
Error: Index exceeds matrix dimensions.
3 Comments
Walter Roberson
on 23 Oct 2017
@Sruthi Geetha: this does not answer the question in any way whatsoever, as you do not clarify the "internal differences" that the title requests. Jim Hokanson already quotes the documentation in the question, so repeating information that can be gleaned from the help is hardly telling us what we all want to know: what are strings like inside: how are they stored in memory, which what encoding, how are they related to any other data types? Rather tantalizingly you wrote that "strings can be considered a complete object": sure, we already know that. But what kind of object?
Perhaps staff are not permitted to answer this question?
Steven Lord
on 23 Oct 2017
The string class was introduced in release R2016b as Walter noted.
The ability to define a string using double quotes like "apple" was introduced in release R2017a. Perhaps that's what Sruthi had in mind.
Categories
Find more on Variables in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!