Got Questions? Get Answers.
Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
Importing data from a textfile

Subject: Importing data from a textfile

From: Jane T

Date: 13 Jul, 2011 20:39:10

Message: 1 of 8

Carrying on froma thread I started a while ago http://www.mathworks.com/matlabcentral/newsreader/view_thread/300663#811546, my problem has now become more complex and I could really use some advice.

Basically my company is running a field trial involving many different instruments and sensors. The data from which will be stored in a single comma delimited textfile, the first 6 digits of which will contain an instrument code, defining what the the string contains ie. $GFDTA, $GFDBG, $AUXP1, $AUXP2. Whilst the structure of all like strings is the same, the different types of string contain different numbers of variables and variable types etc.

I need to be able to process and analyse the data before outputting the required parts in a tab-delimited textfile. It gets more complicated...
1) I don't know how many different types of string there are beforehand.
2) I don't know the structure of each string type.
3) At least one variable in each string will be text rather than numeric

That's not entirely true, but I would like to keep the code as generic as possible so that it can be reused in other studies.

The first stage will be to read the text file and seperate the different types of string, based on the instrument code. The other thread contains code that reads the textfile line by line and then has a switch statement, switching between the already determined instrument codes. The structure of each string type is also already defined. Is there a way of generalising this code for the instrument codes and string structure to be unknown? Or at least input into the function?

Also, these new files will have ~200 000 lines, so speed will be more of an issue this time around.

The code I am hoping to generalise and extend is below and I would be grateful of any advice, even if just the name of an in-built function that might be more relevant.

Many thanks
Jane

% Select file to import
[filename, pathname] = uigetfile('*.gvl');
fid = fopen([pathname,filename]);

% Read first line of file
tline = fgetl(fid);

% Initialise storage arrays
A = [];
G = [];
D = [];

% Loop through each line in the file determining the data type held in the
% line and storing he data to the appropriate array
while ischar(tline)
    switch tline(1:6)
        case '$GF3DW'
            temp = textscan(tline,'%s%n%n%n%n%s','delimiter',',');
            A(end+1,:) = [temp{2:5},...
                datenum(temp{6}{1}(1:end-3),'yyyy/mm/dd HH:MM:SS')];
        case '$GFDTA'
            temp = textscan(tline,'%s%n%n%n%n%n%s%s%s%s','delimiter',',');
            if temp{9}{1}(1:end-3) == '1'
                G(end+1,:) = [temp{2:6},datenum(temp{7}{1},...
                    'yyyy/mm/dd HH:MM:SS'),hex2dec(temp{9}{1}(1:end-3))];
            end
        case '$GFDBG'
            temp =textscan(tline,'%s%n%n%n%n%n%n%n%n%n%s','delimiter',',');
            D(end+1,:) =[temp{2:5},hex2dec(num2str(temp{6})),temp{7:10},...
                datenum(temp{11}{1}(1:end-3),'yyyy/mm/dd HH:MM:SS')];
        otherwise
    end
    tline = fgetl(fid);
end

% Close the .gvl file
fclose(fid);

Subject: Importing data from a textfile

From: Matthew

Date: 14 Jul, 2011 01:31:10

Message: 2 of 8

"Jane T" wrote in message <ivkvpe$hio$1@newscl01ah.mathworks.com>...
> Carrying on froma thread I started a while ago http://www.mathworks.com/matlabcentral/newsreader/view_thread/300663#811546, my problem has now become more complex and I could really use some advice.
>
> Basically my company is running a field trial involving many different instruments and sensors. The data from which will be stored in a single comma delimited textfile, the first 6 digits of which will contain an instrument code, defining what the the string contains ie. $GFDTA, $GFDBG, $AUXP1, $AUXP2. Whilst the structure of all like strings is the same, the different types of string contain different numbers of variables and variable types etc.
>
> I need to be able to process and analyse the data before outputting the required parts in a tab-delimited textfile. It gets more complicated...
> 1) I don't know how many different types of string there are beforehand.
> 2) I don't know the structure of each string type.
> 3) At least one variable in each string will be text rather than numeric
>
> That's not entirely true, but I would like to keep the code as generic as possible so that it can be reused in other studies.
>
> The first stage will be to read the text file and seperate the different types of string, based on the instrument code. The other thread contains code that reads the textfile line by line and then has a switch statement, switching between the already determined instrument codes. The structure of each string type is also already defined. Is there a way of generalising this code for the instrument codes and string structure to be unknown? Or at least input into the function?
>
> Also, these new files will have ~200 000 lines, so speed will be more of an issue this time around.
>
> The code I am hoping to generalise and extend is below and I would be grateful of any advice, even if just the name of an in-built function that might be more relevant.
>
> Many thanks
> Jane
>
> % Select file to import
> [filename, pathname] = uigetfile('*.gvl');
> fid = fopen([pathname,filename]);
>
> % Read first line of file
> tline = fgetl(fid);
>
> % Initialise storage arrays
> A = [];
> G = [];
> D = [];
>
> % Loop through each line in the file determining the data type held in the
> % line and storing he data to the appropriate array
> while ischar(tline)
> switch tline(1:6)
> case '$GF3DW'
> temp = textscan(tline,'%s%n%n%n%n%s','delimiter',',');
> A(end+1,:) = [temp{2:5},...
> datenum(temp{6}{1}(1:end-3),'yyyy/mm/dd HH:MM:SS')];
> case '$GFDTA'
> temp = textscan(tline,'%s%n%n%n%n%n%s%s%s%s','delimiter',',');
> if temp{9}{1}(1:end-3) == '1'
> G(end+1,:) = [temp{2:6},datenum(temp{7}{1},...
> 'yyyy/mm/dd HH:MM:SS'),hex2dec(temp{9}{1}(1:end-3))];
> end
> case '$GFDBG'
> temp =textscan(tline,'%s%n%n%n%n%n%n%n%n%n%s','delimiter',',');
> D(end+1,:) =[temp{2:5},hex2dec(num2str(temp{6})),temp{7:10},...
> datenum(temp{11}{1}(1:end-3),'yyyy/mm/dd HH:MM:SS')];
> otherwise
> end
> tline = fgetl(fid);
> end
>
> % Close the .gvl file
> fclose(fid);


Some sample data would help a lot, but I would become familiar with the regexp functions, especially the 'split' option. You can use it to break an arbitrarily long comma delimited list into a cell array of parts, then operate at will on the parts.

if you know that the instrument code will always start with a $, I would do something like:

%%%%%%%%%%%%%%%%%%%%
instruments = struct;

while(ischar(tline));

  chunks = regexp(tline, ',' , 'split');
  inst_name = chunks{1}(2:end); %truncate the leading $
  if isfield(instruments, inst_name)
       instruments.(inst_name){end+1} = chunks(2:end);
  else
      instruments.(inst_name){1} = chunks(2:end);
  end

end


This will create an instruments struct with a unique entry for each instrument. Each entry is a cell array of cell arrays representing the data. If you know that the format for each instrument will be the same (each instrument will have the same number of elements per line), you could make this a single cell array per instrument using this notation:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%

while(ischar(tline));

  chunks = regexp(tline, ',' , 'split');
  inst_name = chunks{1}(2:end); %truncate the leading $
  if isfield(instruments, inst_name)
       instruments.(inst_name)(end+1,:) = chunks(2:end);
  else
      instruments.(inst_name)(1,:) = chunks(2:end);
  end

end


Another think you might experiment with to speed things up is to read the entire file into memory with the 'fileread()' command, and then use regex and cellfun to search the contents in memory.

Subject: Importing data from a textfile

From: Jane T

Date: 14 Jul, 2011 17:41:32

Message: 3 of 8

Thanks a lot for your help. Structures is the obvious way to go, can't believe it didn't occur to me, especially as I've just been working on another function that uses them extensively!

I agree that some sample code would be useful and would love to have some myself! However, I'm trying to build some code for a dataset that doesn't exist yet, hence the reason for trying to keep it as generalised as possible. I have some from a much simpler system, but this next trial is going to include a larger number of instruments.

$GF3DW, 2.44, 244.2, -6.5, 1.62, 2008/03/12 09:17:02*23
$GF3DW, 2.47, 244.7, -4.1, 1.62, 2008/03/12 09:17:02*23
$GFDTA,1,0.0,0,1.0,15,2008/03/12 09:17:02, CH4OP-1023,2830*57
$GF3DW, 2.44, 246.3, -4.1, 1.63, 2008/03/12 09:17:02*27
$GF3DW, 2.41, 247.6, -3.9, 1.63, 2008/03/12 09:17:02*29
$GF3DW, 2.24, 248.5, -3.2, 1.59, 2008/03/12 09:17:02*24
$GF3DW, 2.17, 245.2, -0.7, 1.61, 2008/03/12 09:17:02*23
$GF3DW, 2.16, 245.8, 2.9, 1.59, 2008/03/12 09:17:03*03
$GF3DW, 2.05, 250.5, 0.8, 1.59, 2008/03/12 09:17:03*0B
$GFDTA,1,0.0,0,1.0,18,2008/03/12 09:17:03, CH4OP-1023,2830*5B
$GF3DW, 2.11, 254.3, -0.7, 1.62, 2008/03/12 09:17:03*26
$GF3DW, 2.20, 257.0, -0.3, 1.64, 2008/03/12 09:17:03*26
$GF3DW, 2.12, 251.8, -2.2, 1.62, 2008/03/12 09:17:03*2C
$GF3DW, 2.14, 253.6, -3.2, 1.64, 2008/03/12 09:17:03*21
$GF3DW, 2.11, 255.8, -1.6, 1.62, 2008/03/12 09:17:03*2C
$GF3DW, 2.00, 259.2, -0.7, 1.62, 2008/03/12 09:17:03*2A
$GF3DW, 1.98, 263.2, -1.2, 1.59, 2008/03/12 09:17:04*2A
$GF3DW, 1.89, 261.6, 0.5, 1.62, 2008/03/12 09:17:04*0F

You are correct in assuming that all data strings associated with the same instrument will have the same format. However since the date is not numeric it won't go into an array without converting it to numerics first. However I think I can adapt your code for initialising each new field to search each element in the line to locate the position of the date and time string and can then apply datenum(). I suspect that analysis on the data will be easier if I store to an array rather than a cell array.

In terms of using fileread(), I like that idea too. I tried similar using the importdata() function. My concern is that I'll run out of memory. My textfiles are likely to have ~200 000 lines. What is the maximum number of rows allowed in a cell array, or does it depend on the system memory available?

Thanks again for your suggestions
Jane

Subject: Importing data from a textfile

From: Matthew

Date: 14 Jul, 2011 18:05:15

Message: 4 of 8

The regexp split function returns the chunks as a cell array, who's members can be mixed classes. You can have a cell array with strings, floats, singles, structures, other cells arrays... whatever.

When you assign a cell array to a variable using array notation (e.g.):

a = cell(1,3);
b = {'foo', 'bar', 3}

c(1,:) = a;
c(2,:) = b;

c is a 2x3 cell array, so as long as you keep assigning 1x3 cell arrays to each row, it doesn't matter what is in those cell arrays. That was example 2.

if you can't be sure that chunks will always be the same length (for each unique probe), then you can use a cell array of cell arrays, where each row in the first cell array is another cell array that can be any length you want. The difference is that indexing in the first way (example 2) is better.

Subject: Importing data from a textfile

From: Jane T

Date: 14 Jul, 2011 22:56:14

Message: 5 of 8

Thanks again. I now have the following code:

% Read file contents to memory
C = fileread([pathname,filename]);

% Delimit into the seperate rows
C = regexp(C,'\n','split');

instruments = struct;

for i = 1:length(C)
  tline = C{i};
  chunks = regexp(tline, ',' , 'split');
  inst_name = chunks{1}(2:end); %truncate the leading $
  if isfield(instruments, inst_name)
       instruments.(inst_name)(end+1,:) = chunks(2:end);
  elseif isvarname(inst_name)
      instruments.(inst_name)(1,:) = chunks(2:end);
  end
end

where
instruments =
    GF3DW: {20529x5 cell}
    GFDTA: {8047x8 cell}
    GFDBG: {123x10 cell}
    GFCSF: {5x1 cell}

However everything is a string.

Every row of the textfile consists of one date and time string and the rest should be numbers. Is there a way I can search the data to determine what each element holds? Maybe when each new field in the structure is initialised. Maybe adapt this line from my previous code where I knew the structure of each line

temp = textscan(tline,'%s%n%n%n%n%s','delimiter',',');

Many thanks again

Subject: Importing data from a textfile

From: Matthew

Date: 15 Jul, 2011 00:04:11

Message: 6 of 8

"Jane T" wrote in message <ivns6e$suo$1@newscl01ah.mathworks.com>...
> Thanks again. I now have the following code:
>
> % Read file contents to memory
> C = fileread([pathname,filename]);
>
> % Delimit into the seperate rows
> C = regexp(C,'\n','split');
>
> instruments = struct;
>
> for i = 1:length(C)
> tline = C{i};
> chunks = regexp(tline, ',' , 'split');
> inst_name = chunks{1}(2:end); %truncate the leading $
> if isfield(instruments, inst_name)
> instruments.(inst_name)(end+1,:) = chunks(2:end);
> elseif isvarname(inst_name)
> instruments.(inst_name)(1,:) = chunks(2:end);
> end
> end
>
> where
> instruments =
> GF3DW: {20529x5 cell}
> GFDTA: {8047x8 cell}
> GFDBG: {123x10 cell}
> GFCSF: {5x1 cell}
>
> However everything is a string.
>
> Every row of the textfile consists of one date and time string and the rest should be numbers. Is there a way I can search the data to determine what each element holds? Maybe when each new field in the structure is initialised. Maybe adapt this line from my previous code where I knew the structure of each line
>
> temp = textscan(tline,'%s%n%n%n%n%s','delimiter',',');
>
> Many thanks again

The simple way would be to write a function that would take a single input, figure out if it needs to be converted to a number or not, then use cellfun on chunks to convert them. However, cellfun uses a for loop inside it, so it is not a fast function.

The way I would go about it would be to write a function that takes the 'chunks' cell array as an input and figures out the ideal format for each element in the array and outputs a format string for use in textscan....or if you know that each line will be the same format, but just don't know how many elements it be, you can just use the number of elements of the chunks array to build the string. Cache this string in a variable, then use it for the rest of the times the instrument is found.

e.g:

instruments = struct();
format_strings = struct();
 
 for i = 1:length(C)
   tline = C{i};

   chunks = regexp(tline, ',' , 'split');

   inst_name = chunks{1}(2:end); %truncate the leading $

   if isfield(instruments, inst_name)
        instruments.(inst_name)(end+1,:) = textscan(tline, format_strings.(inst_name), 'delimiter',',');
   elseif isvarname(inst_name)
       format_strings.(inst_name) = generateFormat(chunks(2:end));
       instruments.(inst_name)(1,:) = textscan(tline, format_strings.(inst_name), 'delimiter',',');
   end
 end

function format_string = generateFormat(chunks)
    %option1 - works for any arrangement of numeric and string data, but is slower
    numeric = cellfun(@str2num,chunks,'UniformOutput',false);
    strings = cellfun(@isempty, numeric);

    format_string = cell(1,numel(chunks));
    format_string(strings) = {'%s'};
    format_string(~strings) = {'%n'};
    format_string = cell2mat(format_string);

    %option2 - if we know it is going to be %s %n .... %s
    format_string = cell(1, numel(chunks));
    format_string(:) = {'%n'};
    format_string([1 end]) = {'%s'};
    format_string = cell2mat(format_string);
end
    
   

Subject: Importing data from a textfile

From: Jane T

Date: 19 Jul, 2011 18:17:09

Message: 7 of 8

Thank you so much for your help Matthew, I'm pretty sure that I now have some really generalised code that gets the data into a format I can use.

I have made a few edits to the code you supplied though; I was finding that the str2num() function was giving an output even for the date strings. The code I now have, and appears to work, is shown below. Parts of it are still "clunky", but they seem to work at least. Thanks again for your help, I don;t think I could have got here without it!

%% Select file to import
[filename, pathname] = uigetfile('*.gvl');

% Read file contents to memory
C = fileread([pathname,filename]);

% Delimit into the seperate rows
C = regexp(C,'\n','split');

instruments = struct();
format_strings = struct();
 
 for i = 1:length(C)
   tline = C{i};

   chunks = regexp(tline, ',' , 'split');

   inst_name = chunks{1}(2:end); %truncate the leading $

   if isfield(instruments, inst_name)
        instruments.(inst_name)(end+1,:) = textscan(tline, format_strings.(inst_name), 'delimiter',',');
   elseif isvarname(inst_name)
       format_strings.(inst_name) = generateFormat(chunks);
       instruments.(inst_name)(1,:) = textscan(tline, format_strings.(inst_name), 'delimiter',',');
   end
 end


function format_string = generateFormat(chunks)
    numeric = cellfun(@testString,chunks,'UniformOutput',true);

    format_string = cell(1,numel(chunks));
    format_string(~numeric) = {'%s'};
    format_string(numeric) = {'%n'};
    format_string = cell2mat(format_string);
end


function output = testString(input)

    x = strtrim(input);
    output = all( (x>='0' & x<='9') | (x=='+') | (x=='.') | (x=='-') );
end

Subject: Importing data from a textfile

From: Matthew

Date: 20 Jul, 2011 11:56:08

Message: 8 of 8

"Jane T" wrote in message <j04hn5$nmc$1@newscl01ah.mathworks.com>...
> Thank you so much for your help Matthew, I'm pretty sure that I now have some really generalised code that gets the data into a format I can use.
>
> I have made a few edits to the code you supplied though; I was finding that the str2num() function was giving an output even for the date strings. The code I now have, and appears to work, is shown below. Parts of it are still "clunky", but they seem to work at least. Thanks again for your help, I don;t think I could have got here without it!
>
> %% Select file to import
> [filename, pathname] = uigetfile('*.gvl');
>
> % Read file contents to memory
> C = fileread([pathname,filename]);
>
> % Delimit into the seperate rows
> C = regexp(C,'\n','split');
>
> instruments = struct();
> format_strings = struct();
>
> for i = 1:length(C)
> tline = C{i};
>
> chunks = regexp(tline, ',' , 'split');
>
> inst_name = chunks{1}(2:end); %truncate the leading $
>
> if isfield(instruments, inst_name)
> instruments.(inst_name)(end+1,:) = textscan(tline, format_strings.(inst_name), 'delimiter',',');
> elseif isvarname(inst_name)
> format_strings.(inst_name) = generateFormat(chunks);
> instruments.(inst_name)(1,:) = textscan(tline, format_strings.(inst_name), 'delimiter',',');
> end
> end
>
>
> function format_string = generateFormat(chunks)
> numeric = cellfun(@testString,chunks,'UniformOutput',true);
>
> format_string = cell(1,numel(chunks));
> format_string(~numeric) = {'%s'};
> format_string(numeric) = {'%n'};
> format_string = cell2mat(format_string);
> end
>
>
> function output = testString(input)
>
> x = strtrim(input);
> output = all( (x>='0' & x<='9') | (x=='+') | (x=='.') | (x=='-') );
> end

no problem. Glad I could help

Tags for this Thread

No tags are associated with this thread.

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us