"readcell()" command does not read my entire file

I am trying to read a csv file that is 43000 kb in size with readcell.
The file is mixed with numbers and strings, and when I try to read a smaller csv file of the same type it reads it no problem.
When I read the bigger file it reads only part of it.
How can I solve this issue?

11 Comments

Can you show the readcell command you ran that only read in part of the CSV file as well as the line in the CSV file where readcell stops importing the data (as well as a few lines before and after that line for context)? You won't be able to attach the whole CSV file since it's big, but showing what's around where readcell stops (or extracting those lines to a new, smaller file and attaching that file, in case there are non-printable characters) may help in determining what stops readcell from reading in all the data.
"You won't be able to attach the whole CSV file since it's big,..."
But, might be able to zip it up and then upload. Otherwise, @Steven Lord's suggestion.
With things like this, it's simply not possible to answer without the data because the problem is going to be data-related -- unless there's a memory issue but that usually will give indications if so.
The alternative to try is to first see how many lines it does read (and is the last line complete?) and then retry with
C1=readcell('yourfile.csv','FileType','text'); % read with expected failure
N=height(C1); % how many lines did it return?
C2=readcell('yourfile.csv','FileType','text','NumHeaderLInes',N); % try to pick up from there
...
You can also try to subtract some number of lines from N to see about the idea of something in the file at the point of failure.
Or, alternatively, use readlines and then inspect the content around the point of failure...so many options...
Tevel
Tevel on 25 Jun 2025
Edited: Tevel on 25 Jun 2025
I can't upload the data because it is work related and thus confidential.
I am pretty sure it is due to the size of the csv file. When i remove rows from a certain row to the last row, readcell reads it no problem.
Did you try any of the above experiments?
Normally, one would get an "Out of Memory" error if memory really were the issue...suppose it is possible otherwise.
What error, if any do you get?
You can look into the MATLAB tools for large datasets including tall and supporting tools if memory really is the issue.
There is an experiment you can do to determine whether the problem is due to file size, or due to file content.
Prepare a second version of the file with a bunch of the leading content removed, but still leaving in some. If you are able to read about as many records out of the second version as from the first version, then you are running out of memory. If instead the reading stops at the same content location, then you know that there is an issue with the content.
dpb
dpb on 25 Jun 2025
Edited: dpb on 25 Jun 2025
My earlier suggested experiment above with the second read should illustrate the same thing, particularly if as noted were to subtract a few lines from the value of N.
Although 43MB doesn't seem terribly big other than when add in the overhead of cell arrays.
Maybe somehow the user is running out of Java memory, and increasing the Java memory might solve / delay the problem ?
dpb
dpb on 25 Jun 2025
Edited: dpb on 25 Jun 2025
Wouldn't you expect the user would have received the "OutOfMemoryError: Java heap space" error if so? Or a regular "OutOfMemory" error if actual memory limit?
I guess on the same track, user should check Preferences in Workspace and see if, perchance, the MATLAB array size limit is set on and to something less than 100% of RAM. Or, if checked, uncheck and see if will work if swap to disk space--if has modern machine using SSD instead of conventional, performance might not be too bad...
But, it would be better if would simply upload the few lines around the point of failure; surely there isn't anything that sensitive that without context anybody could make detrimental use of. Then again, of course, "policy is policy" despite reason/logic.
Hi and thank you for the active ingagement.
I no longer need to use "readcell", because I have switched to using "fopen" instead.
On what you have said, using a different file of the same type, it stops reading it around the same place, while with a short file of that type it reads it no problem, so I am convinced it is a size issue.
Regarding the Java heep space and RAM alocation, I don't know how to do any of that.
"Regarding the Java heep space and RAM alocation, I don't know how to do any of that."
Click on the "Preferences" icon in the toolstrip "Environment" section and explore...all kinds of tweaks you can make there.
The Java heap memory setting is under "General" while the array size limit is under "Workspace"
" because I have switched to using "fopen" instead."
Of course, fopen by itself doesn't do anything except return a file handle; it takes other explicit code to acutally read the file content. It would be interesting to see the full code used...I was going to suggest one could revert to lower-level i/o as an alterntive, but lacking the file format that wasn't really much of an option.
It would be a very interesting exercise to understand if, indeed, MATLAB is failing to successfully read a file with readcell that it can read/store in memory otherwise; that would be fodder very significant to Mathworks in enchancing performance and finding/fixing wasteful memory use.
"Although 43MB doesn't seem terribly big other than when add in the overhead of cell arrays."
What are the dimensions of the CSV file -- how many variables and of what type per field? Are the string data fields of varying liength or some known size (or at least maximum)? How many rows would be typical?
It can be demonstrated about the overhead of a cell array for simple cases to get an estimate of how much memory should be required...
d=ones; md=whos('d');
c={d}; mc=whos('c');
fprintf('Double: %d, Cell: %d, Overhead: %d bytes\n',md.bytes, mc.bytes, mc.bytes-md.bytes)
Double: 8, Cell: 112, Overhead: 104 bytes
d=ones(1,2); md=whos('d');
c={d}; mc=whos('c');
fprintf('Double: %d, Cell: %d, Overhead: %d bytes\n',md.bytes, mc.bytes, mc.bytes-md.bytes)
Double: 16, Cell: 120, Overhead: 104 bytes
d=ones(2); md=whos('d');
c={d}; mc=whos('c');
fprintf('Double: %d, Cell: %d, Overhead: %d bytes\n',md.bytes, mc.bytes, mc.bytes-md.bytes)
Double: 32, Cell: 136, Overhead: 104 bytes
d=ones; d=[d d]; md=whos('d');
c=num2cell(d); mc=whos('c');
fprintf('Double: %d, Cell: %d, Overhead: %d bytes\n',md.bytes, mc.bytes, mc.bytes-md.bytes)
Double: 16, Cell: 224, Overhead: 208 bytes
From which one can deduce the cell array overhead is 104 bytes per cell element over the base data storage. The same can be shown for character arrays with 2 bytes/element instead of 8, of course.
Consequently, given today's typical memory footprint, an extra N*104 bytes per cell could begin to add up with very long and wide files...
But, to bring the same data into MATLAB as one variable array would require the same overhead to put the disparate types into a cell array so the internal footprint would be the same. @Tevel didn't tell/show us what alternate form was used with fopen; but if textscan can succeed while readcell fails, then there's a major flaw in @readcell as it (textscan) must return a cell array if data types are mixed as well.

Sign in to comment.

Answers (0)

Products

Release

R2024b

Tags

Asked:

on 24 Jun 2025

Edited:

dpb
on 27 Jun 2025

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!