How to use Unicode numeric values in regexprep?
14 views (last 30 days)
Show older comments
Vlad Atanasiu
on 28 Mar 2024 at 12:36
How can "Häagen-Dasz" be converted to "Haagen-Dasz" using Uincode numeric values? For example,
regexprep('Häagen-Dasz','ä','A')
works fine, but
regexprep('Häagen-Dasz','\x{C4}','a')
does not. Here, the hexadecimal \x{C4} stands for [latin capital letter a] with diaeresis, i.e. [ä].
1 Comment
VBBV
on 28 Mar 2024 at 13:11
I am not sure if i understand your question right, but Read this answer below
Accepted Answer
Yash
on 28 Mar 2024 at 13:00
Edited: Yash
on 28 Mar 2024 at 13:03
Hi Vlad,
'\x{C4}' represents the Unicode character Ä (Latin Capital Letter A with Diaeresis) in hexadecimal notation.
If you want to replace ä (Latin Small Letter A with Diaeresis), you should use \x{E4}, which is its Unicode hexadecimal representation.
In the context of your question, you're looking to replace ä with a. The correct approach would be to use the Unicode numeric value for ä in the regex and replace it with a. Here is the code:
regexprep('Häagen-Dasz','\x{E4}','a')
Hope this helps!
0 Comments
More Answers (2)
Stephen23
on 28 Mar 2024 at 13:43
inp = 'Häagen-Dasz';
baz = @(v)char(v(1)); % only need the first decomposed character.
out = arrayfun(@(c)baz(py.unicodedata.normalize('NFKD',c)),inp) % remove diacritics.
Read more:
https://docs.python.org/3/library/unicodedata.html
https://stackoverflow.com/questions/16467479/normalizing-unicode
0 Comments
See Also
Categories
Find more on Text Data Preparation in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!