MATLAB Answers

0

Why does matlab save strings from delimited text file as individual characters? And how to prevent.

Asked by Sjouke Rinsma on 8 Sep 2017
Latest activity Edited by Sjouke Rinsma on 12 Sep 2017
So, I have a cell structure in Matlab (containing words, dates and numbers separated by ";" loaded from a very large file) which I take certain lines from, then do some calculations on and finally write each field to a separate file as a table (the words being the headers, the dates and numbers the data).
I have the script functioning more or less okay, be it that I keep running into a particular problem; namely that when splitting the lines using strsplit all entries are treated as individual characters. So when I select a cell entry and add a position, for example A.a{1,1}(2) it returns the second letter of the string. It also does this for numbers, making manipulation difficult. Being splitted strings Matlab treats multi-digit numbers as single numbers, so when I do A.a{1,2} it returns 122, but when I do A.a{1,2}*2 I get ans = 98 100 100 rather then 244. Now I could use str2num, but that doesn't work for words or dates so can become pretty cumbersome... I have a hard time finding the right command to convert all entries to single 'words'. I've also tried using cell2array and array2table commands, but I somehow keep running into issues. Any help would be appreciated!

  4 Comments

Show 1 older comment
Hello Stephen, thanks for your quick reply. Already thought it would not be clear enough. It's just that I work with a 200M+ lines file, so it wasn't so easy to just upload the files. I've used an older version of the script (which still contains some errors) and attached part of a data file to it. I am aware that in this case the numeric values in the result variable are all okay, and that the dates are in a different form. However, when running the similar but bigger script I get a seemingly random mix of quoted and unquoted numeric values (probably due to me not correctly using the data types corretly, I've been somewhat messy!). Nevertheless hope this helps. Thanks in advance.
@Sjouke Rinsma: Thank you for uploading some sample data. I note that all of the columns appear to be numeric, except for the date in the first column. I have no idea why you are wasting your time with importing that data as characters. Why not simply import the data directly as numeric?
Hi Stephen; I get what you're saying, though I'm somewhat fuzzy on how to import a ;-delimited text file as numeric data, since this one also contains the 'non-numeric dates'. dlmwrite does not recognize these, and readtable still imports everything as chars.. but maybe I'm just not familiar with right function to use in this case, or I'm just completely overlooking something.
Nevertheless, for as far as I can see, by the time I've reached line 22 I've got a completely numeric array (if I remove the ; at the end) in which I then rewrite the date. Also, for the files I've uploaded, the script seems to work fine, though as I mentioned before; when I'm working with the larger file I somehow get a matrix where toward the right most columns of a field the data types become mixed (randomly quoted and non-quoted entries in the same column). This also results in written files where some numbers are written as numeric and others as chars (?) with, resulting in different number of digits which makes everything look really messy (I've uploaded the resulting mat-file of the result structure and the final text file for one field, if you're interested). Especially that last part has got me puzzled... I would assume it's not because of the large data set, since that is actually the reason I'm using Matlab in the first place.

Sign in to comment.

1 Answer

Answer by Stephen Cobeldick on 8 Sep 2017
Edited by Stephen Cobeldick on 8 Sep 2017
 Accepted Answer

Rather than wasting time importing the data as character, you would be much better of using textscan to import numeric values as numeric data, for example this reads your entire example file:
opt = {'Delimiter',';', 'CollectOutput',true};
fid = fopen('merged.txt','rt');
hdr = fgetl(fid);
fmt = ['%s',repmat('%f',1,nnz(hdr==';'))];
C = textscan(fid,fmt,opt{:});
fclose(fid);
and checking:
>> size(C{1}) % the number of date strings
ans =
6076 1
>> size(C{2}) % the size of the numeric matrix
ans =
6076 47
>> C{1}{[1,end]} % the first and last dates
ans = 07-09-2017 08:25:33
ans = 07-09-2017 10:40:54
" I work with a 200M+ lines file"
If you have a very large file that cannot be imported at once then you can adapt the code I have shown above using the method given in the MATLAB documentation, which reads blocks of data at-a-time:
Basically the trick is to use the third optional input to specify how many lines to read, and call textscan in a loop.

  1 Comment

Should've refreshed before answering that previous post... nevertheless thanks for this, I will definitely look into it!
And so I did. Seems to be working fine now, thanks :)

Sign in to comment.