I want to get the Unicode for a character. Could you please help me. Which encoding type should I need to choose? UTF8 or Unicode???

I want to convert text to speech for Malayalm language(native language of Kerala). First I need the Unicode value of each letter. I saved the characters in text file . What is the commant to read a text file to get the Unicode.

1 Comment

I saved a Malayalam phoneme in a text file with utf8. I want to get the Unicode of that letter. Could you please send me the code to fetch the file and get the corresponding Unicode value

Sign in to comment.

Answers (1)

The available characters are listed at https://en.wikipedia.org/wiki/Malayalam_(Unicode_block) . They start at U+0D00 which is char(3328) for the first entry.
The way to read the file to get the unicode depends upon exactly how the file was stored. Sometimes the method can be quite simple, but until you know which encoding is being used you have to be more careful. See https://www.mathworks.com/matlabcentral/answers/267176-read-and-seperate-csv-data#answer_209938 for some code of mine that figures out how a documented has been encoded.

8 Comments

You changed the title of your posting to ask the additional question,
"Which encoding type should I need to choose? UTF8 or Unicode???"
UTF-8 is one of the ways of representing Unicode. It is probably the most common way of representing Unicode.
Unicode Code Points in the range you need, U+0Dxx, require 3 bytes each to represent in UTF-8. If you had a long document, if you were to instead use UTF16-LE or UTF16-BE then they would only require 2 bytes each, but fewer applications expect UTF16 so UTF-8 is sometimes more convenient.
"I saved a Malayalam phoneme in a text file with utf8. I want to get the Unicode of that letter. Could you please send me the code to fetch the file and get the corresponding Unicode value"
Use fileread() to read the file. If you assign the result to the variable S, then the unicode code point corresponding to each character is double(S). Just be careful about the fact that Unicode charts are organized in Hexadecimal rather than in decimal. If you want to see the hex version of the Unicode code point numbers, then you would use dec2hex(S, 4)
After running the program it gives the Unicode. But the Unicode of first swarakshara in Malayalam 'അ'is 0D05(from wiki) but I got the answer as 2026. Please help me
I am attaching the the text file which contain the first swarakshara in malayalam 'അ' and the program iI used to find UNICODE
fid = fopen('a.txt', 'r', 'n', 'UTF8');
S = fread(fid,'*char', [1 inf]);
fclose(fid);
if S(1) == 65279; S(1) = ''; end %UTF8 Byte Order Mark
It shows an error in using fread(invalid file identifier)
audiodir = 'C:\Users\NeeK\Documents\MATLAB\EE403\Final Project\malayalm\wav'; %adjust as appropriate
[filename, pathname] = uigetfile('*.txt', 'Choose a text file');
if ~ischar(filename)
fprintf('Cancel!\n')
return; %user cancel
end
fullname = fullfile(pathname, filename);
[fid, msg] = fopen(fullname, 'r', 'n', 'UTF8');
if fid < 0
error('Failed to open file "%s" because "%s"', fullname, msg);
end
S = fread(fid, '*char', [1 inf]);
fclose(fid);
if isempty(S)
fprintf('Text file "%s" is empty!\n', fullname);
return
end
if S(1) == 65279; S(1) = ''; end
audio_data = [];
fs = 1;
for thischar = S
basename = sprintf('%04x.wav', thischar);
this_filename = fullfile(audiodir, basename);
if ~exist(this_filename, 'file')
fprintf('audio file "%s" not found, skipping character "%c"\n', basename, thischar);
else
[thissound, fs] = audioread(this_filename);
if isempty(audio_data)
audio_data = thissound;
else
oldchan = size(audio_data, 2);
newchan = size(thissound, 2);
if newchan < oldchan
thissound(end,oldchan) = 0;
elseif oldchan < newchan
audio_data(end,newchan) = 0;
end
audio_data = [audio_data; thissound];
end
end
end

Sign in to comment.

Categories

Asked:

on 3 Mar 2018

Commented:

on 8 Mar 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!