Converting numeric strings in cell to number for very large dataset

I have read this post which does something similar: https://uk.mathworks.com/matlabcentral/answers/375506-coverting-numeric-strings-in-cell-to-numbers
However I wanted to know if there is a more efficient way to do this as my data-set is very large. My dataset contains both text and numbers I what to convert numbers only to number format.
load('Data.mat')
tic
Qnum=str2double(Q);
toc
This code takes about 124 seconds to run
I have tried preallocating Qnum but the run time is about the same:
Qnum=zeros(length(Q),1);

6 Comments

My full code is as follows. When I tried importing the 5th column (data attached as cell) directly to numbers using %f my code only gave the first line. Sorry the full file is 329MB so is too large to attach.
clear all
clc
year = '2009';
%Ensure header speck is same as file speck except replace %f with %s to
%give header text for numer varables
%The code must have %*s to miss values in the same places to give the
%correct headers
fileIDHeader = fopen('MO_Column_Headers.txt');
formatSpecHeader = '%s %s %s %s %s %*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%s';
DataHeader = textscan(fileIDHeader,formatSpecHeader, 'delimiter',',','EmptyValue',-inf);
NameOfFile = strcat('midas_marine-obs-lon-band-f_',year,'01-',year,'12.txt');
fileID = fopen(NameOfFile);
N=1000000; %Specify a block size
formatSpec = '%D %f %f %s %s %*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%f%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%s';
%In a while loop, call textscan to read each block of data. The file identifier, the format specifer, and the segment size (N), are the first three inputs to textscan. Ignore the commented lines using the CommentStyle name-value pair argument. Specify the tab delimiter using the Delimiter name-value pair argument. Then, process the data in the block. In this example, call scatter to display a scatter plot of temperature and humidity values in the block. The commands within the loop execute while the file pointer is not at the end of the file.
k=0;
Data=[]
while ~feof(fileID)
tic
k = k+1;
Data = textscan(fileID,formatSpec,N, 'delimiter',',','EmptyValue',-inf);
%Data=[Data; DataNew];
end
% selects from the MIDAS data the useful data and seperates the time stamp
% into year month day and hour.
[year,month,day] = ymd(Data{1});
[hour,minute] = hms(Data{1});
UsefulData(:,1) = year; % year
UsefulData(:,2) = month; % month
UsefulData(:,3) = day; % day
UsefulData(:,4) = hour; % hour
UsefulData(:,5) = minute; % minute
UsefulData(:,6) = Data{2}; % LATITUDE
UsefulData(:,7) = Data{3}; % LONGITUDE
UsefulData(:,8) = str2double(Data{1,4}); % stationID number
toc
"Sorry the full file is 329MB so is too large to attach."
You do not need to give us the complete file: all we need is enough so that we have a representative file to work with: in particular the header lines and a megabyte or so of data.
The approach in your question (importing as strings in a cell array then using str2double) is very inefficient. If you really want help with this then please give us a representative file to work with and upload it here: create a new comment, and then click the paperclip button.
What is the best way to extract part of the file to post here? I have extract part of the file here by opening in Excel and then saving as txt but my code dooesn't appear working the way it was with the full file after doing that.
"What is the best way to extract part of the file to post here?"
Use Notepad++.
All the required files are attached in the zip including an extract of the data. I have tested that the files in the zip run the same way with this extract of the data-set as the full data-set.
For an explanation of my MATLAB code formatSpec = '%D %f %f %s %s %*s%*s% ... in line 26 see the Excel file attached where this code is a concatenation of A2:EC2. This file is not required to run the code. All files required to run the code are in the zip in my last comment.

Sign in to comment.

 Accepted Answer

The simplest way to read the example file, assuming you're on R2013b or later:
data = readtable('midas_marine-obs-lon-band-f_200901-200912.txt');
That's it! Numbers are read as numbers, text as text, dates as datetime. readtable figures it all out for you.
If you want to name the columns according to the other file, it's just as easy:
headers = readtable('MO_Column_Headers.txt');
data.Properties.VariableNames = headers.Properties.VariableNames
All done, only takes a few seconds to run.

2 Comments

Thank you for some very simple code, would have saved me about half the week if I had written my code like that in the first place. This code takes about 150 sec to run for a file of data size 935722x133. Is it possible to not spend run-time importing the columns I don't need?
"Is it possible to not spend run-time importing the columns I don't need?"
Possibly. readtable is extremely customisable. You have two options:
  • Use the Format option of readtable and use %*fieldtype for the columns you want to ignore just as you would with textscan. The downside is that you now have to figure out the format.
  • Use detectimportoptions on the file and then edit the SelectedVariableNames property of the options, then pass that to readtable. I've not looked at the details of how detectimportoptions work. It may read the whole file which would not save you any time.

Sign in to comment.

More Answers (0)

Asked:

on 10 Jan 2018

Commented:

on 12 Jan 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!