Time and Space Efficiency Problem in Multi-Format .csv file Splicing and Saving
5 views (last 30 days)
Show older comments
Hi,
I'm recently dealing with a bunch of csv files which are stored in different folders named by date sequence. The aim is to splice the data with the same csv filename (among all folders) together and save for further use.
First, I go over all folders and use unique function to derive a list of .csv file names. Next, I splice the data with identical filename according to date sequence.
data=[];
for i = 1:num_folder
data = [data; readtable(tmp_file)]; % Splicing
end
save(strcat(pwd,'\',filename,'.mat'),'data') % Saving
The saving is -v7.3 by default setting. (the file size range from hundreds of Mbs to several Gbs)
The whole process is running quite slow. I checked the profile viewer and it turns out save function consumed 70% of time and 'splicing' the other 30%. I would love to know how to elevate the time&space efficiency of my code.
[P.S.: The original .csv file have 60 columns. The Majority of them are numbers (double). One is filled with 'true' or 'false'; one filled with 'HH:MM:SS.TTT' and the other filled with 'HH:MM:SS.TTTTTT'. (Since I read them with readtable function, they are save in the table format in a cell manner.) I tried to 'simplify' the table by converting all elemets into double (I thought it could save some space?) but failed in correctly attaining all microseconds (TTTTTT). The textscan can only gives me HH, MM and SS.TTTT.]
Accepted Answer
Steven Lord
on 1 Feb 2023
Assuming your data files don't have any header rows, I'd consider avoiding the read/save process and just use operating system commands to staple together the files. If you then want to process the data in MATLAB you'd only need to read it once (or you could assemble a datastore and create a tall array to work with the data.)
For Windows take a look at the last couple examples on the Microsoft documentation page for the copy command. You could assemble the command in MATLAB from the file names then call system to execute it.
More Answers (1)
dpb
on 1 Feb 2023
Edited: dpb
on 1 Feb 2023
Per usual when things are not totally plain vanilla in some fashion, you need to make use of the extended features supplied in MATLAB. We didn't need but a half-dozen lines to see the file format; I made a shorter version of the original by the following code (which post as may give you some ideas regarding the "splicing" operations you spoke of)
fid=fopen('OI401.csv','r');
fod=fopen('OI401_1.csv','w');
for i=1:100, fwrite(fod,fgets(fid)); end
fclose('all')
to leave me with a big enough file to do something with but still not be huge...
Anyways, that done,
opt=detectImportOptions('OI401.csv'); % create basic import object
opt=setvartype(opt,{'tradingday','logDay'},"datetime"); % set datetime data type
opt=setvaropts(opt,'tradingday','logDay'},"InputFormat",'yyyyDDDD'); % and the input format
opt=setvartype(opt,'isTrading','logical');
opt.SelectedVariableNames=opt.VariableNames([1:6 end-4:end]); % just pick a small subset for demo
tOI=readtable('OI401_1.csv',opt); % and read in with this help
The above returned
>> head(tOI)
ans =
8×11 table
updatetime tradingday mktstatus last high low recvtime logTime logDay status isTrading
__________ __________ _________ ____ ____ ____ _______________ _______________ ________ ______ _________
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:21.430146 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:31.430749 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:41.433568 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:51.431701 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:01.431491 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:11.436309 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:21.435230 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:31.438434 20230117 0 true
>>
And, I didn't think to do it on the import option object, so to prove what we got,
>> ans.tradingday.Format='default'
ans =
8×11 table
updatetime tradingday mktstatus last high low recvtime logTime logDay status isTrading
__________ ___________ _________ ____ ____ ____ _______________ _______________ ________ ______ _________
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:21.430146 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:31.430749 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:41.433568 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:51.431701 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:01.431491 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:11.436309 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:21.435230 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:31.438434 20230117 0 true
>>
shows the dates were imported as datetime.
Now, as far as what to use for analyses, that'll be your call and will depend heavily upon what you intend to do with the data once you have it. There are all kinds of builtin analysis tools for tables that let you select variables by variable name that are quite convenient; whether they'll be of any direct use for your purposes will all depend upon what those purposes are...
See Also
Categories
Find more on Tables in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!