Read Unicode Files

Reads Unicode strings from file, outputs character array of strings

Vlad Atanasiu

Version 1.7 (76.9 KB)

3.7K Downloads

(3)

6 Jan 2015

Open in MATLAB Online

function C = textscanu(filename, encoding, del_sym, eol_sym, wb)
% TEXTSCANU Reads Unicode strings from a file and outputs a cell array of strings
%
% -------------
% INPUT
% -------------
% filename - string with the file's name and extension
% - example: 'textscanu.m.txt'
% encoding - encoding of the file
% - default: UTF-16LE
% - examples: UTF-16LE (little Endian), UTF-8.
% - See http://www.iana.org/assignments/character-sets
% - MS Notepad saves in UTF-16LE ('Unicode'),
% UTF-16BE ('Unicode big endian'), UTF-8 and ANSI.
% del_sym - column delimitator symbol in ASCII numeric code
% - default: 9 (tabulator)
% eol_sym - end of line delimitator symbol in ASCII numeric code
% - default: 13 (carriage return) [Note: line feed=10]
% - on MS Windows use 13, on Unix 10
% wb - displays a waitbar if wb = 'waitbar'
%
% Defaults:
% -------------
% BOM - the first character of the file is assumed to be a
% Byte Order Mark and removed, if it's unicode2native()
% value is 26
% byte_encoding - this value is read from the last two characters
% of the encoding input variable if they are 'LE' or 'BE',
% otherwise 'little endian' is the default for Windows and
% 'big endian' for Unix
% eol_len - number of characters used as end of line markers;
% for a Windows AND a value of 13, eol_len is 2,
% otherwise 1
%
% -------------
% OUTPUT
% -------------
% C - cell array of strings
%
% -------------
% EXAMPLE
% -------------
% C = textscanu('textscanu.txt', 'UTF-8', 9, 13, 'waitbar');
% Reads the UTF-8 encoded file 'textscanu.m.txt', which has
% columns and lines delimited by tabulators, respectively
% carriage returns. Shows a waitbar to make the progress
% of the function's action visible.
%
% -------------
% NOTES
% -------------
% 1. Matlab's textscan function doesn't seem to handle
% properly multiscript Unicode files. Characters
% outside the ASCII range are given the \u001a or
% ASCII 26 value, which usually renders on the
% screen as a box.
%
% Additional information at "Loren on the Art of Matlab":
% http://blogs.mathworks.com/loren/2006/09/20/
% working-with-low-level-file-io-and-encodings/#comment-26764
%
% 2. Text editors such as Microsoft Notepad or Notepad++ use
% a carriage return (CR, ascii 13) and a line feed (LF, ascii 10)
% to mark line ends (when you hit the enter key for example),
% instead of just carriage return as usual on Unix or
% Microsoft Word.
%
% In textscanu use ascii 13 as delimitator in the case of
% end lines marked with the CR/LF combination. Since the LF
% is beyond the end of a given line and not part of the next,
% it is disregarded by the function.
%
% 3. If you get spaces inbetween characters, try changing
% the encoding parameter.
%
% -------------
% BUG
% -------------
% When inspecting the output with the Array Editor,
% in the Workspace or through the Command Window,
% boxes might appear instead of Unicode characters.
% Type C{1,1} at the prompt or in Array Editor click
% on C then C{1,1}: you will see the correct string
% if you have an a Unicode font for the appropriate
% character ranges installed and enabled for the Command
% Window and Array Editor (File > Preferences > Fonts).
%
% However, up to Matlab R2010a at least, Unicode
% characters display as boxes in figures, even if
% data is correctly stored in Matlab as Unicode.
%
% -------------
% REQUIREMENTS
% -------------
% Matlab version: starting with R2006b
%
% See also: textscan
%
% -------------
% REVISIONS LOG
% -------------
% 2015.01.06 - [fix] eol_len now set for all number of input arguments
% 2014.05.04 - [fix] attempt to close figure only if it exists
% 2011.01.17 - [new] support for Unix
% - [new] automatic detection of BOM presence
% 2010.12.31 - [new] no requirement anymore not to end the
% file with end of line marks
% - [fix] define default waitbar handle value
% and make the message more informative
% 2010.10.04 - [fix] upgrade to Matlab version 2007a
% 2009.06.13 - [new] added option to display a waitbar
% 2008.02.27 - function creation
%
% -------------
% CREDITS
% -------------
% Vlad Atanasiu
% atanasiu@alum.mit.edu, http://www.waqwaq.info/atanasiu/

Cite As

Vlad Atanasiu (2026). Read Unicode Files (https://www.mathworks.com/matlabcentral/fileexchange/18956-read-unicode-files), MATLAB Central File Exchange. Retrieved May 5, 2026.

Acknowledgements

Inspired by: Information-based Similarity Toolbox

Inspired: Information-based Similarity Toolbox, WH-1080 weather station data viewer

General Information

Version 1.7 (76.9 KB)
View License

MATLAB Release Compatibility

Compatible with any release

Platform Compatibility

Windows
macOS
Linux

Open in new tab

Version	Published	Release Notes	Action
1.7	6 Jan 2015	[fix] eol_len now set for all number of input arguments	Toolbox Zip
1.6.0.0	19 Nov 2014	Published as Matlab toolbox (.mltbx file).	Toolbox Zip
1.5.0.0	8 May 2014	[fix] attempt to close figure only if figure exists	Download
1.4.0.0	1 Jan 2011	[new] no requirement anymore not to end the file with end of line marks [fix] define default waitbar handle value and make the message more informative [fix] upgrade to Matlab version 2007a	Download
1.1.0.0	13 Jun 2009	Added option to display a waitbar showing the progress of data reading.	Download
1.0.0.0	23 Jun 2008	Adding sample file, clarifying format of input file.	Download

Read Unicode Files

Cite As

Acknowledgements

Categories

Tags

General Information

Requires

MATLAB Release Compatibility

Platform Compatibility