read small selection of data from large file

5 views (last 30 days)
I have several large .csv files (up to around 8GB, the arrays have about 10^5 rows and up to 15k columns) from which I would like to read data. Most of these read operations will only be pulling from 1000 to 10000 data points at a time (generally it will be just a single row of data or a subset of a row). However, it seems like dlmread is doing something inefficiently since each read operation is taking several minutes. Is there a lower-level read function which can do this significantly faster (really, it needs to be orders of magnitude faster; even a 2x increase in speed isn't going to cut it)? Should I use another format for the data? I thought about building a mySQL database for it but I have no experience with this. Is Matlab even the right environment to be doing this sort of thing? Thanks in advance.
Josh

Accepted Answer

Walter Roberson
Walter Roberson on 24 May 2011

Will the same file be read over and over again? If so then it can become worth-while to create an index of where the beginnings of lines are, by reading lines and using ftell() to determine the current position, storing that value, reading more lines, and so on. Once you have the index generated once, you can use it to fseek() to somewhere at or before the target row and then just read that row.

This indexing process does require reading the file through once, but it avoids having to read it from the beginning each time.

None of the operating systems that MATLAB runs on support positioning directly to a particular line in a text file (not unless the lines are all exactly the same length.)

  3 Comments
Walter Roberson
Walter Roberson on 24 May 2011
For the first pass of generating the index, fgets() to read single lines. You might not need to ask the ftell() value for each line: for example if you recorded the ftell() value for each 10 lines then to position to a particular line you would fseek() to the recorded position before that line and then fgets() the lines between.
To read data, once you are positioned where you want, the best method to read the row can depend upon how complex the row is. For example is it all numeric, with commas between the fields, or are there sometimes strings? Is it possible that any of the strings might contain embedded commas? If you have a consistent format, sometimes the fastest approach is to fgetl() the line and then find() the places the line == ',' and index the result to determine the beginning and ending position of the columns of interest, and convert only that range (possibly with sscanf() or textscan())
Josh Warren
Josh Warren on 24 May 2011
Oh, it's very simple; they're all doubles separated by commas. Essentially it's a giant numeric Matlab array that has been exported to a .csv file. Let me see if I can make it work; thanks again.

Sign in to comment.

More Answers (1)

Ashish Uthama
Ashish Uthama on 25 May 2011
If you have the option to change the source or if you plan to use the data over and over again, it might be best to change the format to plain binary. i.e use fwrite to write it out as doubles rather than a text format like csv. (Unless, of course, each line has a varying number of entries and that this structure is integral to your data.)
This would probably be the fastest, since it is the most simplest. Also, the file size might be smaller.
You will be able to index into the file to read subset more easily. You can easily compute the offset to the (i,j)th element since you know exactly how much space a single double will take in a binary file.
  2 Comments
Josh Warren
Josh Warren on 25 May 2011
Thanks Ashish. I'm still a total noob with to actual programming and I just figured out how to implement what Walter was talking about so I think I'll take that approach, although I definitely need to familiarize myself with how to navigate binary vs ASCII files for future projects. I'm in the process of writing my own "high level" read function which checks to see if there is a file containing the line indices in the directory of the file to be read, and if there isn't, it generates this file first before using fseek to get to the line on which the data is contained. Thanks again to both of you for your help.

Sign in to comment.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!