Path: news.mathworks.com!newsfeed-00.mathworks.com!newsfeed2.dallas1.level3.net!news.level3.com!postnews.google.com!r36g2000vbn.googlegroups.com!not-for-mail
From: Bryan Heit <bryans.spam.trap@gmail.com>
Newsgroups: comp.soft-sys.matlab
Subject: Reading textfile
Date: Wed, 16 Sep 2009 17:01:03 -0700 (PDT)
Organization: http://groups.google.com
Lines: 44
Message-ID: <30193250-6f5b-4bfe-ae07-3ead1a055732@r36g2000vbn.googlegroups.com>
NNTP-Posting-Host: 99.242.24.172
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Trace: posting.google.com 1253145663 18981 127.0.0.1 (17 Sep 2009 00:01:03 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Thu, 17 Sep 2009 00:01:03 +0000 (UTC)
Complaints-To: groups-abuse@google.com
Injection-Info: r36g2000vbn.googlegroups.com; posting-host=99.242.24.172; 
	posting-account=i6rmKwoAAACP4NlMjQrEgszvWbe-df6L
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.14) 
	Gecko/2009090217 Ubuntu/9.04 (jaunty) Firefox/3.0.14,gzip(gfe),gzip(gfe)
Xref: news.mathworks.com comp.soft-sys.matlab:570950


I am having trouble reading in a text file.  What I want is to
generate an array of strins, 1 column wide by as many rows long as
there is lines in the dataset.  The dataset is an HTML page saved as
text, containing bioinformatic information.  I'm working on a script
that'll pull specific species data out of the dataset, but cannot make
much progress.  I've tried several ways of reading the data
(importdata, textscan, etc) to no avail.  At best the first 4-5 lines
get read in, then the read process is terminated (there are thousands
of lines).  The data itself looks as follows:

--------------------------------------------------------------------------------
NPSA gnl|sp|P0C9I2  (1107L_ASFK5) Protein MGF 110-7L OS=African swine
fever virus (isolate Pig/Kenya/KEN-50/1950) GN=Ken-016 PE=3 SV=1

*****› PATTERN 1
 Site :    56-   64, Identity
   tyvescrfcw_DCEDGVCTS_riwgnnstsi
--------------------------------------------------------------------------------
NPSA gnl|sp|P0C9I3  (1107L_ASFM2) Protein MGF 110-7L OS=African swine
fever virus (isolate Tick/Malawi/Lil 20-1/1983) GN=Mal-013 PE=3 SV=1

*****› PATTERN 1
 Site :    56-   64, Identity
   tyvescrfcw_DCEDGVCTS_rvwgnnstsi
--------------------------------------------------------------------------------
NPSA gnl|sp|P0C9I4  (1107L_ASFP4) Protein MGF 110-7L OS=African swine
fever virus (isolate Tick/South Africa/Pretoriuskop Pr4/1996)
GN=Pret-017 PE=3 SV=1

*****› PATTERN 1
 Site :    56-   64, Identity
   tyvescrfcw_DCEDGICTS_rvwgnnstsi
--------------------------------------------------------------------------------

This goes on and on - I would like to read every line; even the
'-----' ones and blank ones, into the data array.

Any help would be greatly appreciated.

Bryan