Code covered by the BSD License  

Highlights from
Read Unicode Files

Be the first to rate this file! 17 Downloads (last 30 days) File Size: 2.8 KB File ID: #18956

Read Unicode Files

by Vlad Atanasiu

 

27 Feb 2008 (Updated 01 Jan 2011)

Reads Unicode strings from file, outputs character array of strings

| Watch this File

File Information
Description

TEXTSCANU Reads Unicode strings from a file and outputs a cell array of strings

-------------
INPUT
-------------
filename - string with the file's name and extension
           - example: 'textscanu.m.txt'
encoding - encoding of the file
           - default: UTF-16LE
           - examples: UTF16-LE (little Endian), UTF8.
           - See http://www.iana.org/assignments/character-sets
           - MS Notepad saves in UTF-16LE ('Unicode'),
             UTF-16BE ('Unicode big endian'), UTF-8 and ANSI.
del_sym - column delimitator symbol in ASCII numeric code
           - default: 9 (tabulator)
eol_sym - end of line delimitator symbol in ASCII numeric code
           - default: 13 (carriage return) [Note: line feed=10]
wb - displays a waitbar if wb = 'waitbar'

-------------
OUTPUT
-------------
C - cell array of strings

-------------
EXAMPLE
-------------
C = textscanu('textscanu.m.txt', 'UTF8', 9, 13, 'waitbar');
Reads the UTF8 encoded file 'textscanu.m.txt', which has
columns and lines delimited by tabulators, respectively
carriage returns. Shows a waitbar to make the progress
of the function's action visible.

-------------
NOTES
-------------
1. Matlab's textscan function doesn't seem to handle
properly multiscript Unicode files. Characters
outside the ASCII range are given the \u001a or
ASCII 26 value, which usually renders on the
screen as a box.

Additional information at "Loren on the Art of Matlab":
http://blogs.mathworks.com/loren/2006/09/20/
working-with-low-level-file-io-and-encodings/#comment-26764

2. Text editors such as Microsoft Notepad or Notepad++ use
a carriage return (CR, ascii 13) and a line feed (LF, ascii 10)
to mark line ends (when you hit the enter key for example),
instead of just carriage return as usual on Unix or
Microsoft Word.

In textscanu use ascii 13 as delimitator in the case of
end lines marked with the CR/LF combination. Since the LF
is beyond the end of a given line and not part of the next,
it is disregarded by the function.

-------------
BUG
-------------
When inspecting the output with the Array Editor,
in the Workspace or through the Command Window,
boxes might appear instead of Unicode characters.
Type C{1,1} at the prompt or in Array Editor click
on C then C{1,1}: you will see the correct string
if you have an a Unicode font for the appropriate
character ranges installed and enabled for the Command
Window and Array Editor (File > Preferences > Fonts).

However, up to Matlab R2010a at least, Unicode
characters display as boxes in figures, even if
data is correctly stored in Matlab as Unicode.

-------------
REQUIREMENTS
-------------
Matlab version: starting with R2006b

-------------
REVISIONS LOG
-------------
2010.12.31 - [new] no requirement anymore not to end the
                   file with end of line marks
           - [fix] define default waitbar handle value
                   and make the message more informative
2010.10.04 - [fix] upgrade to Matlab version 2007a
2009.06.13 - [new] added option to display a waitbar
2008.02.27 - function creation

-------------
CREDITS
-------------
Vlad Atanasiu
atanasiu@alum.mit.edu | http://www.waqwaq.info/atanasiu/

MATLAB release MATLAB 7.3 (R2006b)
Tags for This File  
Everyone's Tags
Tags I've Applied
Add New Tags Please login to tag files.
Comments and Ratings (2)
28 Feb 2008 Siyi Deng

I got this error message:

??? Undefined function or variable "eos".

Error in ==> textscanu at 75
    sos = eos + 2; eos = crt(n) - 1;

21 Jun 2008 Vlad Atanasiu

Siyi: Check the format of the input file. Each row must have as many strings as the other rows, the strings have to be tab-delimited, and have carriage-returns at the end of rows. See the sample file zipped with textscanu.m. / Hope this helps. / Vlad

Please login to add a comment or rating.
Updates
01 Mar 2008

The function now supports single column files.

23 Jun 2008

Adding sample file, clarifying format of input file.

13 Jun 2009

Added option to display a waitbar showing the progress of data reading.

01 Jan 2011

[new] no requirement anymore not to end the
file with end of line marks
[fix] define default waitbar handle value
and make the message more informative
[fix] upgrade to Matlab version 2007a

Tag Activity for this File
Tag Applied By Date/Time
data import Vlad Atanasiu 22 Oct 2008 09:50:51
data export Vlad Atanasiu 22 Oct 2008 09:50:51
unicode Vlad Atanasiu 22 Oct 2008 09:50:51
files Vlad Atanasiu 22 Oct 2008 09:50:51
importexport Vlad Atanasiu 22 Oct 2008 09:50:51

Contact us at files@mathworks.com