transmission time of reading the first 200 lines of a large .csv file using datastore from AMZ S3 to a local computer

13 views (last 30 days)
I used the script below to read the first 200 lines of a .csv file on AWS S3 to a local computer. I did it with a small file (a few kB) and a large one (60MB), with the same first 200 lines. It took ~0.5s for the small file and 10-15s for the large one. The 200 lines are the same so the transmission time should be about the same. It seems that Matlab is reading/transmitting the entire file?
ds = tabularTextDatastore(s3_filePath);
ds.ReadSize = 200;
data_first_200 = read(ds);

Answers (1)

Walter Roberson
Walter Roberson on 25 Nov 2025 at 0:59
I suspect that more than 200 lines are being examined automatically in order to deduce the format of the data. If, for example, row 999 had an additional column, then probably the implied format would include the extra column, even though none of the first 200 lines used that format.
If I am right, then speed would be improved by specifying the TextscanFormats option to tabularTextDatastore()
  1 Comment
NX
NX on 25 Nov 2025 at 18:40
Thanks for your answer. Indeed the number of columns are not constant in the .csv file. The first 200 lines are 2 columns and then there are more columns for rows beyond 200. I assume TextscanFormats is for a constant number of columns in a table? How to specify it for the data with variations in column numbers?

Sign in to comment.

Products


Release

R2025b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!