How increase bufsize for importdata

Question

0 votes

I am using the importdata function to import data from tab-separated and comma-separated text files. This works great for files up to at least 10Mb, but fails on files with an identical format in the 70Mb range with the following error.

Caused by: Error using ==> textscan Buffer overflow (bufsize = 1000005) while reading string from file (row 1, field 1). Use 'bufsize' option. See HELP TEXTSCAN.

Is there an easy way to increase bufsize directly in importdata call, without mucking around in the textscan function? I understand that as an alternative I can rewrite my code using textscan directly, but my current M-code file is working with importdata for smaller imports and I am looking for the simplest solution to allow import of larger data sets.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Oleg Komarov on 12 Mar 2011

Open in MATLAB Online

1 vote

You can try to edit line 319 of importdata:

bufsize = min(1000000, max(numel(fileString),100)) + 5;

Set the minimum threshold 1000000 to something higher.

EDIT 14 March 02:14 GMT 00

You had to specify that "NA" should be treated as empty:

fid = fopen('C:\Users\Oleg\Desktop\ancestry-probs-par2.tsv');
% Column headers
colHead = fgetl(fid); 
colHead = textscan(colHead,'%s');
colHead = colHead{1};  
% get # data columns
numH = length(colHead);
% make fmt
fmt = ['%s', repmat('%f',1,numH)];

Import the file in bulk (if enough memory)

% Import file
data = textscan(fid,fmt,'HeaderLines',1,'TreatAsEmpty','NA');
fid = fclose(fid);

Import line by line (26 seconds on my pc, preallocation doesn't give the boost since just 191 lines...)

% Import file
data = cell(0,2);
while ~feof(fid)
data = [data; textscan(fid,fmt,1,'HeaderLines',1,'TreatAsEmpty','NA','CollectOutput',1)];
end
fid = fclose(fid);
rowHead = cat(1,data{:,1});
data = cat(1,data{:,2});

Oleg

9 Comments
Show 7 older comments Hide 7 older comments

David on 13 Mar 2011

Sounds like a great offer! Thanks.

Here's my function. I read one line at a time, because I know it works! Then save this to a string cell. I have managed to get the row and column labels, but I can't quite figure out how to extract the numeric data. Ideally, I would like the numeric data in an array. I have tried cell2mat, but I can't quite get it to work. At the bottom, I paste a test file.

function [data,row_labels,column_labels] = readprobdata_fgetl(filename,dir)

file = [dir,'/',filename];

[status, result] = system( ['wc -l ', file] );

numlines = textscan(result,'%f');

numlines = cell2mat(numlines);

%%Use fgetl

fid = fopen(file);

%get marker names

%read entire file into cell array

for row = 1:numlines

row_str = fgetl(fid);

row_str_cell(row,1) = textscan(row_str,'%s');

end

%get columns labels

column_labels=row_str_cell{1,1};

%get row labels and data

for row = 2:numlines

row_labels{row-1,1} = row_str_cell{row,1}{1,1};

temp_col = row_str_cell{row,1}';

%this line doesn't work

data{row-1,1}=temp_col(2:end,1);

end

%convert cell strings to numeric array

Here is the test file.

test.tsv

2:18372 2:19109 2:19683 2:19696 3:19697 4:20084 X:20117 X:20330

indivA10_GAAGTG .95 1 1 1 1 1 1 1

indivA11_AAAGCG 0 0 0 0 0 .01 .02 .03

indivA12_AATAAG 1 1 1 1 1 1 1 1

indivA1_AAATAG .5 .5 0 0 0 0 0 0

indivA2_TAATTG 1 1 1 1 1 1 1 1

David on 13 Mar 2011

Yes, I understand this would work for this number of columns. I need a solution for a variable and large number of column. I have extended your approach below to get the # cols and rows from the file to calculate the needed buffer size. Unfortunately, this still works fine for small and medium size files, but not for large files. I get col and row labels, but an empty array for data.

function [data,rowHead,colHead] = get_tsv(filename,dir)

file = [dir,'/',filename];

fid =fopen(file);

t = fgetl(fid);

colHead = textscan(t,'%s');

%get # data columns

length = size(colHead{1,1},1);

%make string with correct format for length data

format = ['%s',repmat('%f',1,length)];

%get parameters for buffer size

%get number column characters

[status, num_columns] = system( ['head -n 1 ', file, '| wc -m'] );

num_columns = str2num(num_columns);

%get number of rows

[status, num_rows] = system( ['wc -l ', file] );

num_rows = textscan(num_rows,'%f');

num_rows = cell2mat(num_rows);

bufsize = num_columns * num_rows;

data = textscan(fid,format,'BufSize',bufsize);

%store data

colHead = [colHead{:}];

rowHead = data{1};

data = [data{2:end}];

end

Oleg Komarov on 14 Mar 2011

I forgot to put "'CollectOutput',1" in the bulk import with textscan.

Michael S on 15 Jun 2011

Thanks Oleg this was very helpful. To others if you have a CSV format do not forget that white space is the default delimiter so you need to add 'Delimiter',',' to the textscan arguments. i.e. textscan(fid,fmt,'HeaderLines',1,'Delimiter',',','CollectOutput',1)

Sign in to comment.

Answer 2

Walter Roberson on 12 Mar 2011

0 votes

It looks to me as if it is thinking that the first line is more than 1000000 characters.

How long is the first line?

1 Comment
Show -1 older comments Hide -1 older comments

David on 13 Mar 2011

head -n 1 test_file.tsv | wc -m

1612061

So, I tried bufsize = 1612061 + 100

Same error.

Sign in to comment.

How increase bufsize for importdata

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

9 Comments
Show 7 older comments Hide 7 older comments

More Answers (1)

1 Comment
Show -1 older comments Hide -1 older comments

Categories

Tags

Community Treasure Hunt

How increase bufsize for importdata

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

9 Comments Show 7 older comments Hide 7 older comments

More Answers (1)

1 Comment Show -1 older comments Hide -1 older comments

Categories

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

9 Comments
Show 7 older comments Hide 7 older comments

1 Comment
Show -1 older comments Hide -1 older comments