**You are now following this question**

- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.

# Creating a matrix after giving initial conditions

8 views (last 30 days)

Show older comments

Hi,

I need to create a matrix when given the begin year, month, day, time and end year, month, day and time.

for example; (I do not need to see the header in the matrix)

I would like to enter:

begin year = 1998

begin month = 1

begin day = 1

begin time = 0

end year = 1998

end month = 2

end day = 15

end time = 12

(as shown in the table below)

But, if I enter leap years (e.g. 2000,2004, etc) it should include Feb 29 and do the same time steps (0 to 21) as shown in the table.

Then, the program should generate the output for the time periods given above. See the sample below.

Year Month Day Time

1998 1 1 0

1998 1 1 3

1998 1 1 6

1998 1 1 9

1998 1 1 12

1998 1 1 15

1998 1 1 18

1998 1 1 21

1998 1 2 0

.

.

.

1998 2 15 12

Any help is appreciated.

Thanks in advance.

### Accepted Answer

dpb
on 12 Mar 2015

Edited: dpb
on 12 Mar 2015

yr=2000;

dn=datevec(datenum(yr,1,1,[0:3:24*(365+isleapyr(yr))-1].',0,0))

>> [dn(1:5,:);dn(470:485,:);dn(length(dn)-10,:)]

ans =

2000 1 1 0 0 0

2000 1 1 3 0 0

2000 1 1 6 0 0

2000 1 1 9 0 0

2000 1 1 12 0 0

...

2000 2 28 15 0 0

2000 2 28 18 0 0

2000 2 28 21 0 0

2000 2 29 0 0 0

2000 2 29 3 0 0

2000 2 29 6 0 0

2000 2 29 9 0 0

2000 2 29 12 0 0

2000 2 29 15 0 0

2000 2 29 18 0 0

2000 2 29 21 0 0

2000 3 1 0 0 0

2000 3 1 3 0 0

2000 3 1 6 0 0

2000 3 1 9 0 0

2000 3 1 12 0 0

...

2000 12 30 15 0 0

2000 12 30 18 0 0

2000 12 30 21 0 0

2000 12 31 0 0 0

2000 12 31 3 0 0

2000 12 31 6 0 0

2000 12 31 9 0 0

2000 12 31 12 0 0

2000 12 31 15 0 0

2000 12 31 18 0 0

2000 12 31 21 0 0

>>

I introduced the ellipses in the output to show the breaks clearly...

Salt to suit...

Oh, isleapyr is one of my utilities...

function is=isleapyr(yr)

% returns T for input year being a leapyear

is=(datenum(yr+1,1,1)-datenum(yr,1,1))==366;

there may be a TMW-supplied equivalent since R2012b which lacks it.

##### 11 Comments

Damith
on 12 Mar 2015

Thanks a lot.

How can I remove column 5 and 6 in the above table you produced? And, I need to terminate this table at the given date mentioned in my question (1998 2 15 12). How can I achieve that?

dpb
on 12 Mar 2015

datevec returns the 6-vector; save it to an array and then simply delete the unwanted columns. Or, you can save the explicit variables as

[y,m,d,h]=datevec(...

if it's convenient to have them by variable.

I figured it'd be clear how to stop from the example for a year, sorry--compute the number of periods between begin and end dates--for the 3-hr interval it'd be

>> (datenum(1998,2,15,12,0,0)-datenum(1998,1,1,0,0,0))*24/3

ans =

364

>>

A date number is an integer number of days from an arbitrary point and the fraction is the fractional day of the last day time from 0 to <1. So the above takes that time interval, multiplies by 24 hr/day and divides by the 3-hr interval to find how many intervals are included. Use the end correction of -1 or not depending on whether is to be inclusive or exclusive upper bound; your choice.

Damith
on 12 Mar 2015

OK. How can I include in the code to stop on 1998 2 15 12 0 0

nd=(datenum(1998,2,15,12,0,0)-datenum(1998,1,1,0,0,0))*24/3;

I tried this but it didn't work

yr=1998;

nd=(datenum(1998,2,15,12,0,0)-datenum(1998,1,1,0,0,0))*24/3;

dn=datevec(datenum(yr,1,1,[0:3:24*nd].',0,0));

dpb
on 12 Mar 2015

The timestep is in hours; nd above is the number of 3-hr intervals between to two dates. So, use

dn=datevec(datenum(yr,1,1,[0:3:nd*3].',0,0));

or don't divide by 3 earlier. Again was trying to demonstrate a principle instead of writing code...

Changing variable names to be more reflective and "real" code--

>> nHrs=(datenum(1998,2,15,12,0,0)-datenum(1998,1,1,0,0,0))*24;

>> dv=datevec(datenum(yr,1,1,[0:3:nHrs].',0,0));

>> datestr(dv(end,:))

ans =

15-Feb-1998 12:00:00

>>

Again, the principle is that datenum is smart enough to wrap the lower time fields correctly accounting for leap years and such; the caveat is that for consistency you must use the integer steps rather than a floating point difference of, say, 3/24 fractional days or the floating point roundoff will cause discrepancies if try to do precise date comparisons; the floating point deltas will inevitably be off a bit or two and an exact comparison will fail. If the integer steps are used, they are exact and internally the rounding will always occur the same way so the net result is consistency.

dpb
on 12 Mar 2015

Edited: dpb
on 15 Mar 2015

No problem; sometimes I tend to err on the pedagogical side--it comes from mentoring too many new hires for too long where I was more concerned in getting them to think through how to solve the problem than just answering the question. :)

Anyhoo, to add here, the really bad thing (tm) about the date comparisons if use floating point deltas (as seems perfectly logical to do) is that most of the results will be ok, it's only that occasional case where the rounding is different by that last bit or two so it's possible either to entirely miss that the algorithm isn't exactly correct or it seems infuriatingly near random as to why some/most dates are selected but a few just always get missed. The above eliminates that as long as the generation is consistent.

ADDENDUM

The really, really difficult thing is that when the divisor can be/is exactly represented in floating point, then the two techniques are, in fact, identical. In this case 3/24-->0.125 is, in fact exactly representable and so if remove the datevec cast and return date numbers instead for dv and then

>> df=(datenum(yr,1,[1:3/24:nHrs*10/24].',0,0,0));

>> all(df==dv(1:length(df)))

ans =

1

>>

we do see in this case could've gotten by. This of course, inevitably leads to changing to a dt of 2 hours instead and then the above scenario striketh and there's seemingly no reason why to the initiate.

Damith
on 16 Mar 2015

Hi again,

I know this could be frustrating. Thanks for your willingness to understand and I appreciate that. But, I am trying to achieve my ultimate goal of creating a netCDF file using MATLAB. Since I have not done that in MATLAB, I am trying to figure it out through reading online materials.

Let me elaborate here: The first task I realized is to create a 3D array of time, lat, lon of precipitation field. To achieve that, I have hundreds of .csv files (one file is for time (3hrs) in year folders. For example in year 1998 Jan 01 00 time step file name: TRMM_1998_01_0100_newntcl.csv. I have year folders from 1998 to 2014. The last 2 digits on the file name are for 3hr time intervals. Next time is, TRMM_1998_01_0103_newntcl.csv and so on. So, I was able to figure out to read multiple .csv files in one folder in MATLAB and stored them in a cell array.

That is my "out" cell matrix.

My first objective is to create netCDF file year by year. So, I first picked year 1998. First, I need to sum 3hr time intervals for a day e.g. from 00 to 21. That's what I was trying to do through my original post.

In 1998 folder, I have 365 csv files, but I am interested in summing 3hr file for a complete day. Since, I have complete 3hr time interval files upto Feb 14 (TRMM_1998_02_1421_newntcl.csv) that's why I entered the end data here: (See the code below)

yr=1998;

nd=(datenum(1998,2,14,21,0,0)-datenum(1998,1,1,0,0,0))*24/3;

dn=datevec(datenum(yr,1,1,[0:3:nd*3].',0,0));

dn(:,5:6)=[];

dnT=dn(:,4)'; % dnT = 360 here

then the code below to get the sum

out = vertcat(out{:});

A = cell2mat(out(:,3));

B = reshape(A(1:360,[],8);

C=sum(B,2);

When I do the vertcat for the original "out" cell array, below image shows how it looks.

Array A has all 3rd col of "out" and has 18863565 rows. But when I do

B = reshape(A(1:360),[],8);

it only creates 45 x 8 of for the 3rd col of first row of "out". But, I need to have 45 x 8 matrices for all the rows (3rd col) in out matrix. I need to have 1 x 365 cell matrix with each cell has 45 x 8 cells.

Hope I explained well here. Any help is appreciated.

Thanks again.

dpb
on 16 Mar 2015

Well, let's go back a little...seems to me we've gotten of onto a sidetrack maybe in that I thought the point of the date vector was going to be to do some selection across a larger set of data -- doesn't sound like that's really the case.

OK, let me see if I understand--you have a group of .csv files, one file per day that contains --- well, ok here I am confused. You've got a file for each day of the year as you mention 365 files for a given year...now where are the three-hour values again? Try to just outline the actual file storage/naming alone first, we can address the coding once get a clear understanding of the structure 'cuz now I'm lost, sorry...

dpb
on 17 Mar 2015

OK, I think I got the idea now...just a couple more questions before I cogitate on the way I think probably most efficacious approach here...

If there are complete files, then mod(Number,8)==0 for each subdirectory. What is to be done with days that have missing 3-hour periods? For example, your case here of 365/8=45.625...there are five files for the last day or some others are randomly missing to make up that number.

A complete directory would be either 2920 or 2928 depending on whether it is, isn't a leap year. Is this correct?

Anyway, I think I'd forget about creating the date vectors discussed above and simply go thru a dir() tree for each, but I'll wait for answers above before actually writing code "in anger" (as an old Scottish power engineer I worked with years ago used to say when ready to begin a test sequence after finishing the preliminaries).

dpb
on 17 Mar 2015

OK, so the answer is "ignore days that are incomplete".

I've gotta' do some other things here just now, but I think I have a plan... :) I'll try to get back again in the morning and outline what I think'll work pretty well. Sometimes one only knows how well when actually working out the code if don't write it out on paper and for these exercises, what's the fun in that??? :)

Damith
on 18 Mar 2015

yes, ignore the days that are incomplete.

Will see whether your plan works or not :)

### More Answers (2)

dpb
on 18 Mar 2015

Edited: dpb
on 20 Mar 2015

Since dir returns a sorted list, I think I'd just process it similarly as you are but simply test the days within the loop by parsing the date/time string within a double loop. One could, as mentioned, get fancy and do some serious attempts at matching datenums but just doesn't seem to be needed here...

I'd start with something like the following which builds an array of the summed third column value in matching days by year. Each day is in a given column, only a small additional logic is needed to keep track of which day is present, a month and day row filled in by column while in the loop...

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

N=length(idx);

idx=[idx;length(ymdh)+1];

precip=nan(51681,fix(N)); % preallocate on num whole days

k=0;

for i=1:length(idx)-1

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in

data=zeros(51681,8); % preallocate based on num lat/lon points

for j=0:7 % ok, process each of these, they're one day's worth

data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data

end

k=k+1;

precip(:,k)=sum(data,2); % the daily sums are in an array

fn=d(idx(i)).name; fn(16:17)=[]; % fn(1:1)='O';

disp(fn)

csvwrite(fn,precip(:,k))

end

precip(:,k+1:end)=[]; % remove columns to leave complete days only

This snippet reads only the third column on the assumption you can easily add the constant position columns as you see fit.

I've never used a netCDF file so on casting it into that form I'm not of much use but the above should provide the basic building block to get the data consolidated by day. Note I specifically avoided cell arrays here and did the summing in place while accumulating the data to keep total memory down to minimal amount...as you've already done, use this inside the higher level loop to traverse the directories.

Unless there aren't multiple partial days, the final value of k will be the same as the floor(L/8) calculation. If there are there will be that many zero columns that can simply remove.

Hope that helps...

##### 30 Comments

Damith
on 18 Mar 2015

OK. Thanks.

Does it removes/disregard partial days (days does not have complete time intervals from 00 to 21). I ran your code and it gives me an error below. I have a script to convert to netcdf but if i can get the daily sum output for each day as a csv file and then I can use that script.

Undefined function 'ix1' for input arguments of type 'double'.

dpb
on 18 Mar 2015

"...Does it removes/disregard partial days"

Yes, that's what the logic test checks for--that there are eight sets with the same day consecutively in the grouping. If not, the continue goes on to the next 0 hours location.

Sorry, a typo; I had used ix1 as the index variable in a test at the command line on the logic in that line and missed editing out one 'x' in the cleanup. In "real" code I'd probably turn i1 into idx or somesuch as the '1' is hard to read. Initially I was thinking I'd have an ix1 and ix2 as the starting/stopping indices but decided on the one instead and didn't clean it up to reflect that. Your choice...

Damith
on 18 Mar 2015

Edited: Damith
on 18 Mar 2015

OK. I changed "ix1" to "idx" in the code (See below), and modified

if ~OK, continue, end

to

if ~isOK, continue, end

as OK was undefined.

but I still get an error in the csv read part:

Error using csvread

Too many input arguments.

any idea?

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

precip=zeros(51681,fix(length(idx/8))); % preallocate on num whole days

k=0; % counter for valid day column

for i=idx:length(idx)

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in

for j=0:7 % ok, process each of these, they're one day's worth

data=zeros(51681,8); % preallocate based on num lat/lon points

data(:,j+1)=csvread(d(idx+j).name,0,2); % add hourly data; reads only 3rd column

end

k=k+1;

precip(:,i)=sum(data,2); % the daily sums are in an array

end

dpb
on 18 Mar 2015

Edited: dpb
on 18 Mar 2015

Seems functional here...I copied your file from the link to local...

>> dir TRM*.csv

TRMM_1998_01_0100_newntcl.csv

>>

>> d=dir('TRM*.csv');

>> for i=1:length(d),data=csvread(d(i).name,0,2);end

>> whos data

Name Size Bytes Class Attributes

data 51681x1 413448 double

>> d.name

ans =

TRMM_1998_01_0100_newntcl.csv

How about the section inline from the command window showing everything in context? Maybe can spot something that way I can't see from the above...certainly nothing sticks out to me.

ADDENDUM

for i=idx:length(idx)

Above should be

for i=1:length(idx)

Since idx is a vector, this reassigned the loop index i to that vector so then the argument of d(i) is a multiple group of results instead of just one.

That's a remnant of my initial planning on use i1:i2 as a range and I didn't catch in the cleanup either. The loop is processing each entry in the index vector; the index vector is those locations in the directory list for which the time field is identically zero.

Damith
on 18 Mar 2015

Edited: dpb
on 20 Mar 2015

Anyways, I sent you all the csv files in 1998 folder (I renamed as "test"). Please download from the below link.

hmm. I did this in command window for the 1998 folder where I had 365 csv files.

>> csvread(d(idx+j).name,0,2)

Error using csvread

Too many input arguments.

But, when I did this

>> (d(idx+j).name)

it produced file names of all 00 time intervals from Jan 01 1998 to 1998 Feb 15

ans =

TRMM_1998_01_0100_newntcl.csv

ans =

TRMM_1998_01_0200_newntcl.csv

upto

ans =

TRMM_1998_02_1500_newntcl.csv

dpb
on 19 Mar 2015

Edited: dpb
on 19 Mar 2015

Got the whole vector idx where should just be i subscript to process from the first of the set thru the 8, offset j=0:7 ...

...

data(:,j+1)=csvread(d(idx(i)+j).name,0,2);

...

I believe I caught all references in the original code in toto...

Anyway, if you understand what the logic is doing, should be able to fix any remaining typos pretty quickly.

Again, we're

- finding the locations of each 00 hour in the directory listing, then
- iterating through that list
- makomg sure are eight entries with the same day before the next 00 hours
- if so, loop over the initial plus seven more reading a single file
- when that loop finishes, sum across the eight columns and store

Damith
on 19 Mar 2015

OK. i got it. How can I offset j by 1 at the last loop since it will be 47. So, I am getting this error:

Attempted to access idx(47); index out of bounds because numel(idx)=46.

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

precip=zeros(51681,fix(length(idx/8))); % preallocate on num whole days

k=0; % counter for valid day column

for i=1:length(idx)

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in

for j=0:7 % ok, process each of these, they're one day's worth

data=zeros(51681,8); % preallocate based on num lat/lon points

data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data; reads only 3rd column

end

k=k+1;

precip(:,i)=sum(data,2); % the daily sums are in an array

end

dpb
on 19 Mar 2015

Edited: dpb
on 19 Mar 2015

Oh, yeah, the last 00hrs element won't have a subsequent to test. I don't have time to really think too much about it at the moment but run the loop from 1:length(idx)-1 instead of over the full length.

That change handles all except the end case. So, after the loop finishes on the next-to-last, add one more section that checks to see if the length of the directory listing is greater than idx(end); if so, and is seven more it's another complete (assuming again the days are all the same so it's not just some mishmash of random dates/times that happens to add up to the right number of entries) then run the j loop one more time and increment k one more time and accumulate the sum as well.

Alternatively, you could add an additional if...end clause in the loop for the last iteration and change the test of isOK to the above instead of the existing for that...or you could add one more record as a dummy at the end of the list altho that's probably more effort than either of the other two choices.

Anyway, it's just an end case that needs handling just a little differently than the rest...

ADDENDUM

Woke up this morning and the (almost) trivial fixup came to me...the modification of the index vector is the easiest after all. When create it, simply augment it with the value of length(ymdh)+1 as if were another 00hr entry, but run the outer loop over the original length. Then the +1 index doesn't run over the end on the spacing check but you also don't try to run the loop too many times. See the modification in the ANSWER above...

The changes are...

Add the additional to the end of the index vector...

idx=[idx; length(ymdh)+1];

and change the loop upper limit to length(idx)-1

for i=1:length(idx)-1

and should work.

dpb
on 19 Mar 2015

Damith
on 19 Mar 2015

Edited: dpb
on 19 Mar 2015

OK. I considered it and the modfied code is below. I tried with reduced number of csv files (18) in attached "test_22" attachment to see whether it sum the precipitation across days. Unfortunately, it does not. "test_22" folder has 18 files starting days from 1998 Jan 01, 1998 Jan 2 and 1998 Jan 3 (incomplete time intervals).

For example, I modified the precipitation data in col 3 for Jan 01 for of 8 csv files. So, the sum of precipitation should be 3.5.

First row -40.13 29.88 of

1998 Jan 01 (00, 03, 06, 09, 12, 15, 18, 21) = sum(0.5,0.2,0.2,0.1,0.5,1.0,0.8,0.2 ) = 3.5

Please see the below image to see what the code is producing. Code is producing 0.2 (1st col) that represents Jan 01 which is wrong.

I modified here

data=zeros(51681,fix(length(idx/8)));

but still producing the wrong results. Any idea?

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

%idx=[idx; length(ymdh)];

precip=zeros(51681,fix(length(idx/8))); % preallocate on num whole days

k=0; % counter for valid day column

for i=1:length(idx)-1

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in

for j=0:7 % ok, process each of these, they're one day's worth

data=zeros(51681,fix(length(idx/8))); % preallocate based on num lat/lon points

data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data; reads only 3rd column

end

k=k+1;

precip(:,i)=sum(data,2); % the daily sums are in an array

end

dpb
on 19 Mar 2015

dpb
on 19 Mar 2015

Edited: dpb
on 19 Mar 2015

Well, how stupid...the data array is getting reinitialized to zeros every pass thru the j loop so only the last column is ever saved. DOH!!! Can't believe I hadn't noticed that misplacement long before now. Problem with "air code"; often one doesn't pick up on the obvious.

OK, I made a couple of other minor cleanups, and tested it on your subset of data...

>> damith

precip(1:10,:)

ans =

3.5000 3.0000

3.5000 3.0000

3.5000 3.0000

3.5000 3.0000

3.5000 3.0000

3.5000 3.0000

3.5000 3.0000

3.5000 3.0000

3.5000 3.0000

3.5000 3.0000

>>

That look "more better"???

I updated the script in the Answer section to keep it all together; I just replaced the previous in its entirety rather than try to keep all the iterations...

ADDENDUM

BTW, it seemed to take longer than I'dve thunk it should here so I checked on whether the column-specific read in csvread was the culprit...turns out it's quicker with it than without. I didn't try any other alternative input functions, will leave that as "exercise for the student". :)

I didn't convert the script to a function so that all JIT effects are active, either; that may fix a multitude of sins (or again, may not).

Damith
on 19 Mar 2015

Edited: Damith
on 19 Mar 2015

This is better. I achieved the task like the way below.

Now, is there a way to employ the "csvwrite" function to write the "precip" columns (one by one) corresponding to the "length(idx)-1"? First and second columns (similar to input files) of output files are identical. Or is there any other efficient method to do this?

Basically filename would be 'TRMM_1998_01_01_newntcl.csv' without the time interval.

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

idx=[idx; length(ymdh)];

precip=zeros(51681,length(idx/8)-1); % preallocate on num whole days

k=0; % counter for valid day column

data=zeros(51681,fix(length(idx/8)));

for i=1:length(idx)-1

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in

for j=0:7 % ok, process each of these, they're one day's worth

data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data; reads only 3rd column

end

k=k+1;

precip(:,i)=sum(data,2); % the daily sums are in an array

end

precip(:,k+1:end)=[]; % remove columns to leave complete days only

dpb
on 19 Mar 2015

"I achieved the task like the way below"

...

idx=[idx; length(ymdh)];

The above is wrong; see my code. This will fail if the last day is full because the length computed won't be eight but seven. To prove it to yourself try it with your sample subdirectory if you remove the last two entries and leave 16 instead of 18; you'll only get one day's result instead of two.

If you don't have any reason to have the daily sums as a group there's no point in saving the whole array; I had presumed having those in one location for further analyses would be a primary point of the exercise; continuing to have so many files with so much repetitive data seems counter productive to the max. But, sure, inside the loop on i simply use a substring operation on the file name to remove the time section; doesn't matter which one so may as well use the last when the j loop has completed...

fnameout=d(idx(i)+j).name; % save to a temporary

fnameout(16:17)=[]; % remove the two hour digits from ddhh

csvwrite(fnameout,[xy precip(:,i)] % write to a csv file

xy is the array of coordinates saved from an initial read of one file previously, of course, since they're constant across all files.

Again, if this is all you're going to be doing with the sums I'd not even bother to save the precip array at all; just use the result of the sum directly in the concatenation.

I know you've said your goal is a netCDF file but if you're really going to be doing something with these data in Matlab a .mat file would be much smaller and faster keeping stuff together in a larger conglomerate.

Damith
on 19 Mar 2015

Yes. You are quite correct. I removed the last two files and it produced only the first one. So, I corrected the code like yours. Thanks for correcting me. This shows how ignorant I am in MATLAB.

I tried again to write the output to csv files since I have a another piece of code written in python to convert to a netCDF.

IS your method producing csv file for each day? I need to have a separate csv file for each day? (Only for complete days) For example 'TRMM_1998_01_01_newntcl.csv', 'TRMM_1998_01_02_newntcl.csv'

Anyways, I tried this but give me an error.

fnameout=d(idx(i)+j).name; % save to a temporary

fnameout(16:17)=[]; % remove the two hour digits from ddhh

csvwrite(fnameout,[xy precip(:,i)]) % write to a csv file

Index exceeds matrix dimensions.

.....................................................

xy=xlsread('TRMM_1998_01_0100_newntcl.csv', '', 'A1:B51681');

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

N=length(idx); % the unaugmented length

idx=[idx;length(ymdh)+1];

precip=nan(51681,fix(N)); % preallocate on number starting days

k=0;

for i=1:N

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or a wrong day mixed in

data=zeros(51681,8); % preallocate based on num lat/lon points

for j=0:7 % ok, process each of these, they're one day's worth

data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data

end

k=k+1;

precip(:,k)=sum(data,2); % the daily sums are in an array

end

precip(:,k+1:end)=[]; % remove columns to leave complete days only

fnameout=d(idx(i)+j).name; % save to a temporary

fnameout(16:17)=[]; % remove the two hour digits from ddhh

csvwrite(fnameout,[xy precip(:,i)]) % write to a csv file

dpb
on 19 Mar 2015

Well, it won't put it out for each day unless it's inside the loop over i, no. That you've got it outside all the loops is the problem.

Again, if this is all you're going to do, I'd remove the precip array entirely and make it look something like...

for i=1:N

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end

data=zeros(51681,8);

for j=0:7

data(:,j+1)=csvread(d(idx(i)+j).name,0,2);

end

fname=d(idx(i)+j).name; fname(16:17)=[];

csvwrite([xy fnsum(data,2)]

end

There's no point in saving what you're never going to use. I ran the filename creation portion here w/o a trailing semicolon and got...

>> damith

fn =

TRMM_1998_01_01_newntcl.csv

fn =

TRMM_1998_01_02_newntcl.csv

>>

(Clearly I just used "fn" as the variable name for testing)

Damith
on 20 Mar 2015

Edited: Damith
on 20 Mar 2015

Bingo, it worked.:) I modified your code a bit. See the code below.

But one problem, I just noticed when I ran this code your for the folder where I had 365 files from 1998 Jan 01 to 1998 Feb 14 (for complete days) this code only produced the output csv files only upto 1998 Jan 31. Code ran smoothly but no error message but no daily output from 1998 Feb 01 to 1998 Feb 14.

Could not figure out why? Any idea?

If I need to obtain the monthly and yearly sum, how can I modify within the loop. Again, I need csv output for each month and year.? Rememver I have years from 1998 to 2014.

xy=xlsread('TRMM_1998_01_0100_newntcl.csv', '', 'A1:B51681');

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

N=length(idx); % the unaugmented length

idx=[idx;length(ymdh)+1];

precip=nan(51681,fix(N)); % preallocate on number starting days

k=0;

for i=1:N

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end

data=zeros(51681,8);

for j=0:7

data(:,j+1)=csvread(d(idx(i)+j).name,0,2);

end

fname=d(idx(i)+j).name; fname(16:17)=[];

csvwrite(fname,[xy,sum(data,2)])

end

dpb
on 20 Mar 2015

Well, the first thing to do is to check what the list of files returned is and the length of the index vector--firstly, does it have the proper number of entries and then secondly is all(diff(idx)==8) true (1) or false (0)? If false, how many are true (sum(diff(idx)))? That's how many would be at least candidates. Double check that the directory returned is, indeed, sorted and that the names are such that the sorting is correct.

After that, don't see a particular reason otomh, but it surely can't be too complex if you do some diagnostic checking.

As for monthly and yearly sums, you're changing ground rules--I don't know how many times I mentioned "if this is all you're going to do" with the data and you never said word boo! it wasn't... :(

In that case you've got to start accumulating the daily sums again and keeping them such that you can then sum over them over the proper dimensions. Question I'd have is, however, how do you define what to sum over for a month and a year--do you need complete data sets for all days of the month, for example, since we've got cases here where we've got missing days since we ignored days for which didn't have a complete set of 3-hour observations.

dpb
on 20 Mar 2015

Edited: dpb
on 20 Mar 2015

ERRATUM Was too late last night, actually the above should be sum(diff(idx)==8) as well to see how many valid days there are found. Turns out here that returns 45 and

>> idx(end-1)/8

ans =

45.1250

>>

so '45' is the right answer. I did download the files and it ran to completion here ok as expected. Just to make it easy I changed the output file name to have the first character be an 'O' for output just so could do an easy dir...the result is

>> o=dir('o*.csv');

>> length(o)

ans =

45

>> o(end).name

ans =

ORMM_1998_02_14_newntcl.csv

>>

Hence, I conclude the code as I posted in the above 'Answer' in one location is functioning as expected. Just to be certain it's in synch with what is here I just replaced the code there with that taken from the editor window here that produced the above results (with the 'O' patch commented out)...

dpb
on 20 Mar 2015

"...to obtain the monthly and yearly sum ..."

OK, it's just an extension of the above idea but as noted above you've got to decide what it is you want to do about the missing cases. Assuming the data are complete-enough you can do as above and ignore the missing entirely, something like the following logic should work...

Within each subdirectory with this loop you've got a year's worth of data if it is complete.

While processing the above loop for each day, we've built an array of preciptation values by day for complete days but incomplete days were skipped over entirely and the array has no space for them. If, otoh, we built a 3D array that is 31 columns wide for the longest months and 12 planes deep for each month and stored the valid daily data in the proper column then can simply sum over the 2nd dimension (columns) for the monthly sums and over the planes or the yearly. Now whether either of these is valid is dependent upon whether there are any "holes" in it or not and how you want to treat those.

Unless you also want to aggregate over multiple years, that's all you have to do other than simply iterate the above over the various subdirectories.

Damith
on 20 Mar 2015

Thanks a lot.

The modified code above produced output file names upto TRMM_1998_02_14_newntcl.csv as expected. So, the generation of daily part is done. Now, I have a code written in Python which can be used to convert to netCDF. But, it only reads csv files and I need to compare with monthly and annual precip values of producing from a different methodology. That means I have to deveop daily, monthy and annual netCDF files.

dpb
on 21 Mar 2015

Edited: dpb
on 21 Mar 2015

Not the second with the working code base...

As for the second part, NB you have ymdh array around and available for the indexing mentioned above...

ADDENDUM

Having a working function for the days, you might look at a File Exchange submission <filefun--apply a function to files> to abstract the code for the higher levels. While I've not used it yet, it is highly rated and just might make writing the iteration stuff much simpler.

UPDATE

I did download and look at filefun -- while very nicely done and would be very useful for many iterative/repetitive tasks, in this particular case "not so much" since the files needs must be processed as groups instead of actually applying a function to each individually. So, you do need the loop to traverse the subdirectories...guess the function needed here would be one higher level of abstraction dirfun that would apply the function to the contents of a directory which will be your outer loop when you're done.

Damith
on 23 Mar 2015

Where can I accept your reply? I could not find. Let me know.

I have started working on to obtain the monthly and annual values. In your code above, what does this line do?

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

I am trying to understand and modify a bit.

Please let me know.

Thanks again.

dpb
on 23 Mar 2015

Edited: dpb
on 24 Mar 2015

Maybe once you accepted the first on the dates it won't let you accept a 2nd answer, too, maybe? I hadn't tried, I only respond rarely ask questions (and the ones I have have been rather esoteric so don't necessarily get answers... :) )

That was discussed way up earlier in response to your first question after the posting of the code:

"...Does it removes/disregard partial days"

Yes, that's what the logic test checks for--that there are eight sets with the same day consecutively in the grouping. If not, the continue goes on to the next 0 hours location.

It's looking at the 3rd column (days) of the ymdh array and checking that there are no differences >0 amongst those in the group between the current 0-hr entry and the next and then that there are a total of eight between groups. The first is the same as

all(ymdh(idx(i):idx(i+1)-1,3)==ymdh(idx(i),3))

I just chose to check the differences were all zero instead of that all were the same value as the first, explicitly. It's the same end result.

If you're going to process the daily files for complete days I don't think that's the place to be modifying...you should be able to simply process the daily data from the precip array as mentioned before after accumulating it for each day (again, presuming you keep track as suggested of the actual day of month instead of just incrementing a counter and store in the column based on day.

Damith
on 24 Mar 2015

Thanks.

I was not successful in implementing this and simply do not know where and how to start with the precip array. Some coding help/hint will energize me. I need monthly and annual values for complete sets of data.

Your help is appreciated again.

dpb
on 24 Mar 2015

dpb
on 24 Mar 2015

Edited: dpb
on 29 Mar 2015

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

N=length(idx);

idx=[idx;length(ymdh)+1];

precip=nan(51681,31,12); % preallocate on num whole days

%fprintf('Building daily sums for %d: \n',ymdh(1,1))

dispstat('','init')

for i=1:length(idx)-1

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in

%msg=sprintf(' Month: %2d Day: %2d',ymdh(idx(i),2:3));

%dispstat(msg,'keepthis')

data=zeros(51681,8); % preallocate based on num lat/lon points

for j=0:7 % ok, process each of these, they're one day's worth

%msg=sprintf('Reading hourly file: %s: ',d(idx(i)+j).name);

%dispstat(msg)

data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data

end

mo=ymdh(idx(i),2); % this month

da=ymdh(idx(i),3); % this day

precip(:,da,mo)=sum(data,2);

fn=d(idx(i)).name; fn(16:17)=[]; fn(1:1)='O';

%disp(fn)

csvwrite(fn,precip(:,k))

end

mTot=squeeze(sum(precip,2)); % monthly totals in each column 1-12

yTot=sum(mTot,2); % yearly total

Now just write mTot and yTot to files as you want them. As noted before, you'll probably want to do some checking also on whether there's data in every day of the month before accumulating that total and then for the months before the yearly total or you may have totals that aren't actually representative of the entire month/year.

Using nan to initialize instead of zero is one way, then any sums with NaN as a value are indicative altho there you've got to fix up the detail of how many days are in each month but that's pretty simple bookkeeping...

##### 30 Comments

Damith
on 24 Mar 2015

OK. I get this error just after running the code below. I am debugging the code to understand what's happening. I think it's I think it's

precip(:,d,m)=sum(data,2);

Improper index matrix reference.

clear all

cd ('C:\Users\Desktop\test_1998')

%xy=xlsread('TRMM_1998_01_0100_newntcl.csv', '', 'A1:B51681');

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

N=length(idx);

idx=[idx;length(ymdh)+1];

precip=nan(51681,31,12); % preallocate on max days/mo by 12 mo/yr

k=0;

for i=1:length(idx)-1

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in

data=zeros(51681,8); % preallocate based on num lat/lon points

for j=0:7 % ok, process each of these, they're one day's worth

data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data

end

m=ymdh(idx(i),2); % this month

d=ymdh(idx(i),3); % this day

precip(:,d,m)=sum(data,2); % the daily sums in column day of month

fn=d(idx(i)).name; fn(16:17)=[]; % fn(1:1)='O';

disp(fn)

csvwrite(fn,precip(:,k))

end

mTot=squeeze(sum(precip,2)); % monthly totals in each column 1-12

yTot=sum(mTot,2); % yearly total

dpb
on 24 Mar 2015

Edited: dpb
on 24 Mar 2015

Reused d again, change the day index retrieved from the ymdh array to something else; that just wiped out the directory structure. That's a lesson in debugging, use the debugger and see what is going on...

BTW w/ that correction (I used da and mo instead of d and m), the above ran to completion with your subdirectory of files.

Again, you'll have to decide how to deal with the issue of missing data somehow; perhaps that's not an issue in general but the samples you've given do have at least one day missing a 3-hr period that was ignored so that means by similarity that that corresponding month/year is missing a day and hence also is missing which means that the year altogether would be missing. That seems like probably too onerous of a criterion but I can't tell you how to treat missing values; that's a problem you've got to address.

Damith
on 24 Mar 2015

Edited: Damith
on 24 Mar 2015

I modified the code as shown below and checked the outputs of mTot and yTot and the results seems fine. But I want to write the monthly outputs as csv files. I tried the code below the for loop at the end but did not write the csv file. Any idea?

clear all

cd ('C:\Users\Desktop\test_1999')

dd=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([dd.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

N=length(idx);

idx=[idx;length(ymdh)+1];

precip=nan(51681,31,12); % preallocate on max days/mo by 12 mo/yr

k=0;

kk=0;

for i=1:length(idx)-1

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in

data=zeros(51681,8); % preallocate based on num lat/lon points

for j=0:7 % ok, process each of these, they're one day's worth

data(:,j+1)=csvread(dd(idx(i)+j).name,0,2); % add hourly data

end

k=k+1;

m=ymdh(idx(i),2); % this month

d=ymdh(idx(i),3); % this day

precip(:,d,m)=sum(data,2); % the daily sums in column day of month

fn=dd(idx(i)).name; fn(16:17)=[]; % fn(1:1)='O';

disp(fn)

csvwrite(fn,precip(:,k))

end

for ii=1:length(m)-1

mTot=squeeze(sum(precip,2)); % monthly totals in each column 1-12

fnn=dd(idx(ii)).name; fnn(13:17)=[]; % fn(1:1)='O';

kk=kk+1;

disp(fnn)

csvwrite(fnn,mTot(:,kk))

end

yTot=sum(mTot,2);

dpb
on 25 Mar 2015

for ii=1:length(m)-1

There is no "m"???

The array is complete already after the previous sum; simply write each column to a file of choice; the squeeze results in the original 3D array being converted to 2D as the collapsed previous 2nd dimension of 1 is removed.

At this point I'm going to leave accomplishing the above as "exercise for the student"; you need to develop some facility on your own, I'm not always going to be around. But, the loop if you do want a single file for every month and year goes after the sum, not before.

Damith
on 25 Mar 2015

Edited: Damith
on 29 Mar 2015

Thanks again.

I have made mistake previously. See the code below. I think I am having a problem here. I tested the code with 'test_2003' data and looked at the mTot matrix (see the image below). The code does not produce the monthly sum for the months of 2 and 4 (see the image below).

Please download 'test_2003' data at :

https://bft.usu.edu/r7ht8

My problem to obtain the file name to loop through with year and month write to csv files from the _mTot_matrix.

clear all

cd ('C:\Users\Desktop\test_2003')

xy=xlsread('TRMM_2003_01_0100_newntcl.csv', '', 'A1:B51681');

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

N=length(idx);

idx=[idx;length(ymdh)+1];

precip=nan(51681,31,12); % preallocate on max days/mo by 12 mo/yr

k=0;

for i=1:length(idx)-1

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in

data=zeros(51681,8); % preallocate based on num lat/lon points

for j=0:7 % ok, process each of these, they're one day's worth

data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data

end

k=k+1;

mo=ymdh(idx(i),2); % this month

da=ymdh(idx(i),3); % this day

precip(:,da,mo)=sum(data,2); % the daily sums in column day of month

fn=d(idx(i)).name; fn(16:17)=[]; % fn(1:1)='O';

disp(fn)

csvwrite(fn,[xy precip(:,k)])

end

mTot=squeeze(sum(precip,2));

dpb
on 25 Mar 2015

Edited: dpb
on 25 Mar 2015

"...does not produce the monthly sum for the months of 2 and 4..."

That would indicate there's missing values at least one day within those months. That's what I outlined just a couple of comments ago that you've got to decide how to treat missing data--

"Again, you'll have to decide how to deal with the issue of missing data somehow; perhaps that's not an issue in general but the samples you've given do have at least one day missing a 3-hr period that was ignored so that means by similarity that that corresponding month/year is missing a day and hence also is missing which means that the year altogether would be missing. That seems like probably too onerous of a criterion but I can't tell you how to treat missing values; that's a problem you've got to address."

As that note says, if you carry thru to the higher levels the same logic of ignoring a missing hour or two meaning to ignore that day, then if that day is missing, the corresponding month isn't complete and so that month will be missing. Consequently, at the next step, that month is missing and so the year will be incomplete and the year will also be missing.

I don't (and can't) know what you want to do about missing values...you can pretend they're zero, or try to impute something for those missing times or you can simply live with the resulting holes. The data are what they are, I don't think there's anything wrong with the code altho not having that particular set of data I can't reproduce it here (and won't, this is, after all in the end, your research, not mine...)

If you want to "just see what happens" you can either

- Use zeros instead of nan to initialize the array which is the "assume zero" option, or
- Use nansum if have the Stat Toolbox to only sum the finite values (which is, in essence, the same assumption, just computed differently and that you do have the info to know which are and are not complete datasets that gets lost the other way)

Alternatively, you could go back and forget about the "missing hours" test and at least pick up partial days where they do exist, but I expect there will probably still be some holes where a full day may have been uncollected.

It's not necessarily an easy problem to solve cleanly...missing data are a pita albeit a fact of life. Again, I can't tell you what is most appropriate for you to do...

Damith
on 25 Mar 2015

yes, thats due to NaN for months which do not have 31 days (e.g. Feb, April etc) , so I used nansum funtion so I modified as follows,

mTot=squeeze(nansum(precip,2));

gives me the correct result.

So, the modified last piece of code is below. But, if I can get the number of columns which has data I can autimate the loop.

mTot=squeeze(nansum(precip,2)); % monthly totals in each column 1-12

for ik=1:12

cm1=sprintf('%3d',100+ik);

cm=cm1(2:3);

fnn=['TRMM_',cm,'_newntcl.csv'];

disp(fnn)

csvwrite(fnn,[xy mTot(:,ik)])

end

dpb
on 25 Mar 2015

Edited: dpb
on 26 Mar 2015

"due to NaN for months which do not have 31 days..."

Ah, yes. I didn't include it explicitly but my thinking was that one would use another logical array that overlays the rectangular array that is T for existing days in year by month. It would automagically account for leap year by looking at the year.

I tend to simply assume an implementer/end user will do the dirty work of finishing off a basic idea thrown together here, sorry...

ADDENDUM

I presume you've got this all working by now, and I've not done anything else beyond this point to sand off the rough edges but it came to me that perhaps the "clever" way is while doing the loop on days to zero fill the particular column/plane if there is data; then the summation will work as wanted for existing data but you'll still have the NaN indication for the missing locations. Still don't know how significant that's going to be in the larger scheme of things, but it's a starting point that will let you know rather than hiding a potential problem if simply either all zero fill or just nansum w/o checking.

Damith
on 26 Mar 2015

@dbp,

I was checking the daily output files again for year 2004 and I just noticed that the daily precipitation values calculated using the previous code (i.e. Code 1 below) is correct opposed to the code posted later with sum for monthly (i.e. Code 2). Therefore, I am extending the code 1 to calculate montly sum. Any comments?

Code 1:

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

N=length(idx);

idx=[idx;length(ymdh)+1];

precip=nan(51681,fix(N)); % preallocate on num whole days

k=0;

for i=1:length(idx)-1

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in

data=zeros(51681,8); % preallocate based on num lat/lon points

for j=0:7 % ok, process each of these, they're one day's worth

data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data

end

k=k+1;

precip(:,k)=sum(data,2); % the daily sums are in an array

fn=d(idx(i)).name; fn(16:17)=[]; %fn(1:1)='O';

disp(fn)

csvwrite(fn,[xy precip(:,k)])

end

precip(:,k+1:end)=[]; % remove columns to leave complete days only

Code 2:

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

N=length(idx);

idx=[idx;length(ymdh)+1];

precip=nan(51681,31,12); % preallocate on max days/mo by 12 mo/yr

k=0;

for i=1:length(idx)-1

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in

data=zeros(51681,8); % preallocate based on num lat/lon points

for j=0:7 % ok, process each of these, they're one day's worth

data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data

end

k=k+1;

mo=ymdh(idx(i),2); % this month

da=ymdh(idx(i),3); % this day

precip(:,da,mo)=sum(data,2); % the daily sums in column day of month

fn=d(idx(i)).name; fn(16:17)=[]; % fn(1:1)='O';

disp(fn)

csvwrite(fn,[xy precip(:,k)])

end

Damith
on 27 Mar 2015

@dpb

I am trying to understand why the results are different from the two codes. Code 1 seems correct in terms of producing daily results. But, I'm trying to understand why code 2 producing different daily results. Can you help me to expand code 1 by reading the daily csv files (e.g.TRMM_2000_01_30_newntcl.csv) but not storing the output into 3D array. Seems kind of complex to me.

dpb
on 27 Mar 2015

Edited: dpb
on 27 Mar 2015

The difference in the two is that the first simply increments a counter for the column irrespective of whether there is or is not data specifically for that day. The latter, otoh, puts the daily data into the column based on the day of the month.

ADDENDUM

Also remember that the former adds a new column for the same day for the subsequent month whereas the latter puts the total for the day in the same column but in the subsequent plane based on the actual month, 1-12.

I'd probably automate some checks--as suggested once very early on, you could add two rows to the first solution and keep the month and day for each column k in them. Then, use those when traversing the columns to look up the column/plane intersection of the second and see that all() of those corresponding elements are the same. I'm virtually certain it'll check out ok. END

I don't see any reason there should be any other difference between the two nor why there should be any problem in storing the data in the 3D array. Use the debugger to step through a few steps to satisfy yourself the indexing is correct but I'd first suspect k isn't the same as the day value is the cause of any discrepancy.

I suppose it's possible there's some other logic flaw but in all the testing I did here I didn't see any sign of it.

dpb
on 27 Mar 2015

Edited: dpb
on 27 Mar 2015

OK, I did the above test...I modified the original version using simply an index by adding

moda=zeros(2,N); % allocate space for da, mo

and inside the loop when the precip() array is computed and stored, added

mo=ymdh(idx(i),2); % this month

da=ymdh(idx(i),3); % this day

moda(1,k)=mo; moda(2,k)=da; % save the date

which is the same logic as the second for the month and day so that's consistent with its storage. I then saved this set of precipitation totals as precip1 to not overwrite the 3D precip array and ran the two versions on the largest set of data I have here which is that for 1998.

After that, I wrote the following test at the command line --

>> for i=1:length(moda)

if ~all(precip1(:,i)==precip(:,moda(2,k),moda(1,k))),

fprintf('Mo%d Da%d',moda(:,i)),end,end

>>

Note I got no output so that there were no discrepancies found(*); the storage in the 3D array is, commensurate to that in the 2D array and the data in the corresponding vectors when retrieved for the same month and date are identical, as expected.

I see no reason the 3D won't work correctly for every data set as written.

() Just to prove to meself the logic was actually doing what intended, I did things like put a counter in an *else clause to calculate the number of matches and checked that it did in the end equal the length of the array (45 in this case).

Damith
on 27 Mar 2015

Edited: Damith
on 27 Mar 2015

OK. Are you using the daily csv files to read and calculate monthly? or simply not creating a 3D array?

What is precip1 in the abobe code? or what does this do:

for i=1:length(moda)

if ~all(precip1(:,i)==precip(:,moda(2,k),moda(1,k)))

Can I simply replace precip(:,da,mo)=sum(data,2); by precip(:,moda(2,k),moda(1,k)))

Do you think I can still use the squeeze function to calculate monthly with precip

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

N=length(idx);

idx=[idx;length(ymdh)+1];

precip=nan(51681,31,12); % preallocate on max days/mo by 12 mo/yr

k=0;

moda=zeros(2,N); % allocate space for da, mo

for i=1:length(idx)-1

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in

data=zeros(51681,8); % preallocate based on num lat/lon points

for j=0:7 % ok, process each of these, they're one day's worth

data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data

end

k=k+1;

mo=ymdh(idx(i),2); % this month

da=ymdh(idx(i),3); % this day

moda(1,k)=mo; moda(2,k)=da; % save the date

precip(:,da,mo)=sum(data,2); % the daily sums in column day of month

fn=d(idx(i)).name; fn(16:17)=[]; % fn(1:1)='O';

disp(fn)

csvwrite(fn,[xy precip(:,k)])

end

dpb
on 27 Mar 2015

Edited: dpb
on 27 Mar 2015

As said, I ran two scripts...the one that uses the incremented column irrespective of what the day really is and the second that builds the the 3D array; so that I still had both in memory when they were done, I just renamed precip in the former to precip1.

The if simply tests that all values in the column in the 3D array for the specific month and day are identical to those for the same month and day in the 2D array accounting for where the two are stored relative to each other. As noted, the linear column addressing doesn't account for which month/day it is, specifically, it just adds another column if there are data; if, for example, there's a day missing for Jan 2 of the first year then Jan 3 will be in column 2 and everybody else will also move up one. The same thing happens if there's any other missing days; the subsequent then will be that many days shifted from the nominal position of day-of-year. OTOH, the 3D solution has the advantage it does put all the data in the right location automagically, each 3rd day of the month in column 3, each month of the year in plane M for the correct month, 1 thru 12. The indexing in that loop simply extracts the day and month that were saved for each of the k columns and uses that to translate into the 3D array to get the same actual month/day data which are, of course, then shown to be identical.

The result of the test is to show that the logic building the 3D array works correctly; I don't see why you would want to do anything different...whatever the issue you've got has to be related to the fact that the columns you're comparing aren't the ones that should be compared owing to the storage location absolute column number difference outlined.

ADDENDUM

Both scripts computed the results independently from the .csv files; yes, that was the point of the exercise--to prove that the results are, in fact, identical if you compare the correct columns to each other.

Damith
on 27 Mar 2015

Edited: Damith
on 27 Mar 2015

OK. So, I modified the code as shown below. squeeze doesn't do the correct job. I calculated the monthly values separately and the values in mTot is different after 1st month (i.e. Jan).

I used zeros for precip initialization (see the code below) and used sum instead of namsum. I noticed the problem now is the squeeze function. It does not produce the expected outcome. Is there a way to get the sum from a 3D array differently w/out using squeeze function.?

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

N=length(idx);

idx=[idx;length(ymdh)+1];

precip=zeros(51681,31,12); % preallocate on max days/mo by 12 mo/yr

k=0;

moda=zeros(2,N); % allocate space for da, mo

for i=1:length(idx)-1

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in

data=zeros(51681,8); % preallocate based on num lat/lon points

for j=0:7 % ok, process each of these, they're one day's worth

data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data

end

k=k+1;

mo=ymdh(idx(i),2); % this month

da=ymdh(idx(i),3); % this day

moda(1,k)=mo; moda(2,k)=da; % save the date

%precip(:,da,mo)=sum(data,2); % the daily sums in column day of month

precip(:,moda(2,k),moda(1,k))=sum(data,2); % the daily sums in column day of month

fn=d(idx(i)).name; fn(16:17)=[]; % fn(1:1)='O';

disp(fn)

csvwrite(fn,[xy precip(:,k)])

end

mTot=squeeze(sum(precip,2));

dpb
on 27 Mar 2015

Well, w/o trying to debug what you've modified needlessly, my first guess is that the change you made is wrong. You're now mixing up the k indexing with the 3D and I'm betting it's not consistent that way w/o really working on it which I'm not spending time doing since the original 3D version was shown to be correct above. If you're going to use the 3D array, use it; if not do whatever you want however you see fit.

With the properly aligned 3D array wherein the months are all in the same column on top of each other, squeeze after the summation does "do the right thing"; it collapses the resulting one dimension leaving a subsequent 2D array.

To illustrate, we'll use some dummy data on a smaller scale; the principle is identical...

>> p=repmat([0.1:0.1:0.7],10,1); % small set of a vector to 2D

>> for i=2:3, p(:,:,i)=i*p(:,:,1); % add two more planes for months

>> p

p(:,:,1) =

0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000

0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000

0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000

...

0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000

0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000

p(:,:,2) =

0.2000 0.4000 0.6000 0.8000 1.0000 1.2000 1.4000

0.2000 0.4000 0.6000 0.8000 1.0000 1.2000 1.4000

0.2000 0.4000 0.6000 0.8000 1.0000 1.2000 1.4000

...

0.2000 0.4000 0.6000 0.8000 1.0000 1.2000 1.4000

0.2000 0.4000 0.6000 0.8000 1.0000 1.2000 1.4000

0.2000 0.4000 0.6000 0.8000 1.0000 1.2000 1.4000

p(:,:,3) =

0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000

0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000

0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000

0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000

0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000

0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000

0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000

0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000

0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000

0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000

>>

So, above we've got a small test case of 7 days by 3 months...let's create the "monthly" and "yearly" sums...

>> mTot=sum(p,2)

mTot(:,:,1) =

2.8000

2.8000

...

2.8000

2.8000

2.8000

2.8000

mTot(:,:,2) =

5.6000

5.6000

5.6000

...

5.6000

mTot(:,:,3) =

8.4000

8.4000

...

8.4000

8.4000

>> whos mTot

Name Size Bytes Class Attributes

mTot 10x1x3 240 double

OK, the above is obviously the correct sum for the seven columns but it's spread out over three planes with the values for each coordinate a vector in that plane. That's inconvenient so instead

>> mTot=squeeze(mTot)

mTot =

2.8000 5.6000 8.4000

2.8000 5.6000 8.4000

2.8000 5.6000 8.4000

2.8000 5.6000 8.4000

2.8000 5.6000 8.4000

2.8000 5.6000 8.4000

2.8000 5.6000 8.4000

2.8000 5.6000 8.4000

2.8000 5.6000 8.4000

2.8000 5.6000 8.4000

>>

So the squeeze operation removes the singleton second subscript and leaves us with a 2D array of the same values where now the 2nd dimension is that of the months. It's identical in result whether use two steps as above or collapse into one as I did in the original code.

Now, the yearly total is again the sum across the columns which will leave a single vector of length of the number of coordinates.

>> yTot=sum(mTot,2);

yTot =

16.8000

16.8000

16.8000

16.8000

16.8000

16.8000

16.8000

16.8000

16.8000

16.8000

>>

The size of the arrays is absolutely immaterial to the shape operations as outlined above the only difference in this example and the real thing is the size of the arrays; the resulting shape after the summation and dimension reduction is identical and is, as shown above, indeed what you want/need.

The only "trick" is again treating the shorter months and missing days; as I suggested if you fill the full column with zeros initially for each day that has data, then the only NaN left will be those for missing days and sum will reflect that in the totals.

Damith
on 27 Mar 2015

Edited: Damith
on 29 Mar 2015

@dbp,

So, frustrated, I checked the output using a simple script to observe the daily and monthly outputs. I am sorry to bother you so much but I am kind of frustrated since the code below still gives me the wrong outputs. PLEASE DOWNLOAD THE DATA FROM THE LINK ATTACHED AND TRY IT. (data for year 2004, my intention is to run this code yearly, i.e. in one folder there are no mixed years).

So, please do the necessary changes to code below.

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations

N=length(idx);

idx=[idx;length(ymdh)+1];

precip=zeros(51681,31,12); % preallocate on max days/mo by 12 mo/yr

k=0;

moda=zeros(2,N); % allocate space for da, mo

for i=1:length(idx)-1

isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;

if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in

data=zeros(51681,8); % preallocate based on num lat/lon points

for j=0:7 % ok, process each of these, they're one day's worth

data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data

end

k=k+1;

mo=ymdh(idx(i),2); % this month

da=ymdh(idx(i),3); % this day

moda(1,k)=mo; moda(2,k)=da; % save the date

precip(:,da,mo)=sum(data,2); % the daily sums in column day of month

fn=d(idx(i)).name; fn(16:17)=[]; % fn(1:1)='O';

disp(fn)

csvwrite(fn,[xy precip(:,k)])

end

I checked the daily and monthly outputs using the code below from the precip 3D array and compared with the sum of data directly read from the csv (sub-daily) files. I compared the mTot_Jan, mTot_Feb etc with the values I directly read from the sub-daily csv files. So, the monthly values are WRONG for some months.

for iJan=1:31

JanT(:,iJan)=precip(:,iJan,1);

end

if(mod(ymdh(1,1),4)==0)

mend=29;

else mend=28;

end

for iFeb=1:mend

FebT(:,iFeb)=precip(:,iFeb,2);

end

for iMar=1:31

MarT(:,iMar)=precip(:,iMar,3);

end

for iApr=1:30

AprT(:,iApr)=precip(:,iApr,4);

end

for iMay=1:31

MayT(:,iMay)=precip(:,iMay,5);

end

for iJun=1:30

JunT(:,iJun)=precip(:,iJun,6);

end

for iJul=1:31

JulT(:,iJul)=precip(:,iJul,7);

end

for iAug=1:31

AugT(:,iAug)=precip(:,iAug,8);

end

for iSep=1:30

SepT(:,iSep)=precip(:,iSep,9);

end

for iOct=1:31

OctT(:,iOct)=precip(:,iOct,10);

end

for iNov=1:30

NovT(:,iNov)=precip(:,iNov,11);

end

for iDec=1:31

DecT(:,iDec)=precip(:,iDec,12);

end

mTot_Jan=sum(JanT,2);

mTot_Feb=sum(FebT,2);

mTot_Mar=sum(MarT,2);

mTot_Apr=sum(AprT,2);

mTot_May=sum(MayT,2);

mTot_Jun=sum(JunT,2);

mTot_Jul=sum(JulT,2);

mTot_Aug=sum(AugT,2);

mTot_Sep=sum(SepT,2);

mTot_Oct=sum(OctT,2);

mTot_Nov=sum(NovT,2);

mTot_Dec=sum(DecT,2);

dpb
on 28 Mar 2015

"...monthly values are WRONG for some months."

How about a particular month and value you think is in error?

dpb
on 28 Mar 2015

Edited: dpb
on 28 Mar 2015

I ran the identical code as that I had in the updated "Answer" http://www.mathworks.com/matlabcentral/answers/182710-creating-a-matrix-after-giving-initial-conditions#answer_172554 with the exception of a progress indicator of which file was being processed and the day being built which have no bearing on the actual logic and looked at a particular location...the first node for all days for the first month...

>> precip(1,:,1)

ans =

Columns 1 through 9

0 0.0400 0 0 0 0.9400 0 0 0

Columns 10 through 18

0 0 0 0.3900 0.0100 0 0 0 0

Columns 19 through 27

0 0 0.1600 0 0 8.8100 0.0200 2.1000 0

Columns 28 through 31

0 0.0200 0 0

That's the first month...day 24 seems interesting; it's got a sizable total accumulation so let's see if it's ok...

>> d24=dir('TRMM_2004_01_24*_newntcl.csv'); % return all files for Jan24

>> tot=0;for i=1:length(d24)

d=csvread(d24(i).name,0,2);

tot=tot+d(1); disp([d(1) tot]),end

0 0

0.1700 0.1700

8.6400 8.8100

0 8.8100

0 8.8100

0 8.8100

0 8.8100

0 8.8100

>>

Voila!! They match. That a given day/month matches since it's a generic routine seems pretty conclusive to me there's not a logic error in the building of the 3D array nor the summations. I have essentially 100% confidence the same result would hold for any combination of node and day and month. I can still only guess that you've got something amiss in the indexing.

Note there are only six months included and only the first 24 days of that month are complete in the data that was in the download unless if you somehow have additional data other than that and are getting some other values that aren't in the processing by a file-naming issue or such.

ADDENDUM

Oh, the other thing that would screw up a comparison would be to include the values from the incomplete June 25 data in your totals as compared to the processed data that requires a full day (or any other missing days; I glanced at the list as it was processed and it looked complete but I didn't check that religiously; it's possible there's another incomplete day in there that I didn't notice...but if you aren't culling out those incomplete days your comparisons will be different assuming there's any accumulation in one of those hourly files that was discarded because the day wasn't complete.

If that were the case then we're back to the previous discussion of your having to decide how you want to treat them.

ADDENDUM 2

Just to check on the monthly totals from raw data as well...

>> jTot=0;

>> dJan=dir('TRMM_2004_01_*_newntcl.csv');

>> for i=1:length(dJan)

p=csvread(dJan(i).name,0,2);

jTot=jTot+p(1);

if p(1)>0

disp([dJan(i).name ' ' num2str([p(1) jTot])]),end,end

TRMM_2004_01_0206_newntcl.csv 0.04 0.04

TRMM_2004_01_0612_newntcl.csv 0.94 0.98

TRMM_2004_01_1315_newntcl.csv 0.39 1.37

TRMM_2004_01_1412_newntcl.csv 0.01 1.38

TRMM_2004_01_2115_newntcl.csv 0.16 1.54

TRMM_2004_01_2403_newntcl.csv 0.17 1.71

TRMM_2004_01_2406_newntcl.csv 8.64 10.35

TRMM_2004_01_2515_newntcl.csv 0.02 10.37

TRMM_2004_01_2603_newntcl.csv 1.55 11.92

TRMM_2004_01_2606_newntcl.csv 0.55 12.47

TRMM_2004_01_2909_newntcl.csv 0.02 12.49

>> mTot(1,:)

ans =

Columns 1 through 9

12.4900 21.4800 3.0000 8.7200 27.4500 9.7000 0 0 0

Columns 10 through 12

0 0 0

>>

Again, the values add up correctly.

Damith
on 28 Mar 2015

Edited: dpb
on 28 Mar 2015

The monthly values are wrong for the months from March, April, May and June for year 2004. (Download the attached excel file: https://bft.usu.edu/vbw9r And I checked for other years, and it's the same. So, if I remove any days that does not complete for any given month (i.e. in this case if I remove days from Jun 1 to 24, will it calculate the monthly values correct.?

I prefer to check whether there are any incomplete days for a given month (for example for 2004, June is incomplete so remove the days from Jun 1 to 24 for calulation for monthly totals.

dpb
on 28 Mar 2015

Edited: dpb
on 28 Mar 2015

March looks ok to me...

>> mTot(1,:)

ans =

Columns 1 through 9

12.4900 21.4800 3.0000 8.7200 27.4500 9.7000 ...

Columns 10 through 12

0 0 0

>> dMar=dir('TRMM_2004_03_*_newntcl.csv');

>> length(dMar)/8 % check it's got full complement

ans =

31

>> mrTot=0;

>> for i=1:length(dMar),p=csvread(dMar(i).name,0,2);mrTot=mrTot+p(1);if p(1)>0,disp([dMar(i).name ' ' num2str([p(1) mrTot])]),end,end

TRMM_2004_03_0209_newntcl.csv 0.15 0.15

TRMM_2004_03_0212_newntcl.csv 0.02 0.17

TRMM_2004_03_0218_newntcl.csv 0.02 0.19

TRMM_2004_03_0821_newntcl.csv 0.04 0.23

TRMM_2004_03_1200_newntcl.csv 0.41 0.64

TRMM_2004_03_1306_newntcl.csv 0.17 0.81

TRMM_2004_03_1312_newntcl.csv 0.05 0.86

TRMM_2004_03_1618_newntcl.csv 0.01 0.87

TRMM_2004_03_1709_newntcl.csv 0.23 1.1

TRMM_2004_03_1721_newntcl.csv 0.01 1.11

TRMM_2004_03_2009_newntcl.csv 0.56 1.67

TRMM_2004_03_2115_newntcl.csv 0.09 1.76

TRMM_2004_03_2303_newntcl.csv 1.01 2.77

TRMM_2004_03_2412_newntcl.csv 0.01 2.78

TRMM_2004_03_2709_newntcl.csv 0.1 2.88

TRMM_2004_03_2915_newntcl.csv 0.04 2.92

TRMM_2004_03_3003_newntcl.csv 0.08 3

>>

That matches mTot(1,3) as expected. Try another random point...

>> idx-randi(length(precip))

idx =

9399

>> dMar=dir('TRMM_2004_03_*_newntcl.csv');

>> mrTot=0;

>> for i=1:length(dMar),p=csvread(dMar(i).name,0,2);mrTot=mrTot+p(idx);

if p(idx)>0,disp([dMar(i).name ' ' num2str([p(idx) mrTot])]),end,end

TRMM_2004_03_0809_newntcl.csv 0.01 0.01

TRMM_2004_03_0815_newntcl.csv 2.83 2.84

TRMM_2004_03_1300_newntcl.csv 0.04 2.88

TRMM_2004_03_1512_newntcl.csv 0.08 2.96

TRMM_2004_03_1806_newntcl.csv 0.02 2.98

TRMM_2004_03_2303_newntcl.csv 11.07 14.05

TRMM_2004_03_2306_newntcl.csv 0.04 14.09

TRMM_2004_03_2309_newntcl.csv 1.43 15.52

TRMM_2004_03_2706_newntcl.csv 0.03 15.55

TRMM_2004_03_3015_newntcl.csv 0.08 15.63

TRMM_2004_03_3103_newntcl.csv 0.29 15.92

>> mTot(idx,3)

ans =

15.9200

>>

I can't seem to find a value that doesn't compare ok as long as there are full days.

I don't understand your other comment regarding what you prefer; as requested the script ignores all days which do not have all eight 3-hr readings as we've discussed before.

Again, how you want to handle the missing values is a decision you have to make as to what it means to have data recorded for a partial day and then how that is going to be propagated up to monthly and yearly sums. I've no real input on how you should do that or whether you even should or not.

As previously noted, however, the zero-fill and the nansum options both equate to the same thing as presuming the data are complete with no precipitation for any of those hours/days which clearly isn't the right answer in general altho it undoubtedly is for specific hourly observations.

It's why I had suggested NaN for the initial full array allocation and then to zero-fill for a full column that is processed before storing the data. That then only propagates a NaN for sums that do include no day but avoids doing so for the months with less than 31 days without negative consequences on those months totals but will show you where there are holes in you cumulative data. As noted, I fear that may result in having no yearly totals because you probably have no records that are truly complete. If so, yet again, then, you're back to the problem of what's a reasonable alternative that doesn't corrupt the database too badly? I've no klew on what to tell you about that conundrum, unfortunately, sorry...

But, my conclusion is the code is working as intended to process the 3-hr data files to aggregate them; I don't see any case that actually generates a bogus result for the summations as compared to the individual files.

dpb
on 28 Mar 2015

Edited: dpb
on 28 Mar 2015

"As previously noted, however, the zero-fill and the nansum options both equate to the same thing as presuming the data are complete with no precipitation for any of those hours/days ..."

That is, once you accumulate the sums by either of these ways you've lost the indication of which were or were not full days; that piece of information has been lost. Consequently, it appears from all that anybody could tell by reading your newly-created data files that that is what the results were as measured in toto; it's a false impression, however, we know there are at least some days that have partial records that aren't being reported and we also know there are large chunks missing entirely (or at least with the data I've seen so far). That surely should be made clear to whoever is planning on using these data for anything other than just observing numbers where there happen to be some--surely one could make no claim of an annual total of any accuracy that way or compare to other historical data or whatever one had in mind to do with the data.

dpb
on 28 Mar 2015

OK, one last try to verify the previous conjecture--let's look at June.

>> juTot=0;

>> for i=1:length(dJun)

p=csvread(dJun(i).name,0,2);

juTot=juTot+p;

if any(p)>0,disp([dJun(i).name ]),end,end

TRMM_2004_06_0100_newntcl.csv

TRMM_2004_06_0106_newntcl.csv

TRMM_2004_06_0112_newntcl.csv

TRMM_2004_06_0118_newntcl.csv

...

TRMM_2004_06_2200_newntcl.csv

TRMM_2004_06_2206_newntcl.csv

TRMM_2004_06_2212_newntcl.csv

TRMM_2004_06_2218_newntcl.csv

TRMM_2004_06_2300_newntcl.csv

TRMM_2004_06_2306_newntcl.csv

TRMM_2004_06_2312_newntcl.csv

TRMM_2004_06_2318_newntcl.csv

TRMM_2004_06_2400_newntcl.csv

TRMM_2004_06_2406_newntcl.csv

TRMM_2004_06_2412_newntcl.csv

TRMM_2004_06_2418_newntcl.csv

TRMM_2004_06_2500_newntcl.csv

TRMM_2004_06_2506_newntcl.csv

TRMM_2004_06_2512_newntcl.csv

TRMM_2004_06_2518_newntcl.csv

>>

>> max(abs(juTot-mTot(:,6))) % oooh, there is a problem!!!

ans =

49.6600

>>

OK, now let's limit the loop to only the full days of 1-24...

>> juTot=0;

>> N=8*floor(length(dJun)/8)

for i=1:N

p=csvread(dJun(i).name,0,2);

juTot=juTot+p;

if any(p)>0,disp([dJun(i).name ]),end,end

TRMM_2004_06_0100_newntcl.csv

TRMM_2004_06_0106_newntcl.csv

TRMM_2004_06_0112_newntcl.csv

TRMM_2004_06_0118_newntcl.csv

TRMM_2004_06_0200_newntcl.csv

TRMM_2004_06_0206_newntcl.csv

TRMM_2004_06_0212_newntcl.csv

TRMM_2004_06_0218_newntcl.csv

TRMM_2004_06_0300_newntcl.csv

TRMM_2004_06_0306_newntcl.csv

...

TRMM_2004_06_2400_newntcl.csv

TRMM_2004_06_2406_newntcl.csv

TRMM_2004_06_2412_newntcl.csv

TRMM_2004_06_2418_newntcl.csv

>> i

i =

192

>> max(abs(juTot-mTot(:,6)))

ans =

8.5265e-14

>> all(abs(juTot-mTot(:,6))<5*eps(max(juTot)))

ans =

1

>>

OK, they're not absolutely identical owing to the difference in order in the summations between the script and the one-at-a-time processing, but the difference is in the expected magnitude of floating point roundoff.

NB: also that this test is accumulating the full-length summation rather than just a single index so we've proven the contention made earlier than the point index used is immaterial to testing the logic; testing one value is as good as testing them all for storage location.

It also proves the point made earlier that as long as one processes only complete data then the sums being built in the script are ok based on the storage location by column/plane by day/month; it's when you add in the values that are collected for a day that isn't complete that a real discrepancy occurs.

So, it's a decision you've got to make about how to handle the incomplete and missing data.

Damith
on 29 Mar 2015

Edited: Damith
on 29 Mar 2015

OK. I understand. Can you please help me to modify the code which calculates monthly values using complete days.? For example, for year 2004, months from Jan to May since June monthly days are not complete.

I need to obtain sum of monthly precip for months which has complete days. For example, for year 2004, I need to obtain the sum of monthy values from Jan to May since for year 2004 since number of days for for June is not complete. The same thing we adopted when calculating daily values using sub-daily, i.e. we disregard the days which did not have 8 sub-daily data.

dpb
on 29 Mar 2015

Edited: dpb
on 29 Mar 2015

That's testing whether there is NaN in location that is valid for the day of month for the given month (remembering leap year of course). That's the idea I outlined earlier of the logic array to select those valid days. How much effort you must go to is dependent upon whether it's possible there are months with "holes" in them or whether it's always only whether there are 1:N<=M where N is the number of days of data and M is the number of days in the given month.

Another possible tack is to add a sum for the number of days that have been accumulated for each month by month in the script at the point the precip array is being populated; then you can simply check that the content of that array is the same as the expected number of days in the month for the given year.

Unfortunately, the simple expedient of zero-ing the day vector for the month doesn't work to prevent the partial months from being totalled; it then would include June in the 2004 sum so have to go to the (slight) extra effort.

The second idea above would be something like

nDays=zeros(12,1); % initialize before the outer loop

and then after the computation of the month,day variables mo, da insert

nDays(mo)=nDays(mo)+1;

This will give you the count of the actual full days that were found which you can compare to the expected number.

Damith
on 29 Mar 2015

Edited: Damith
on 29 Mar 2015

OK. I tried this as suggested. But still mTot gives the exact same result as shown in the Excelwork sheet. I checked the number of sub-daily files and there are no missing files. For examples, from Jan 01 - Jun 24 (176 days) and I have 176 * 8 = 1408 files.

I tried something like this by reading daily csv files (e.g.TRMM_2004_01_01_newntcl.csv and so on). Can you please suggest me a efficient method that I can adopt for all years and counting for months which has complete daily csv files?

Now the total is matching with the Excel file. (See the image below)

d=dir('TRMM_*_newntcl.csv');

ymdh=cell2mat(textscan([d.name], ...

'TRMM_%4d_%2d_%2d_newntcl.csv', ...

'collectoutput',true));

idx=find(ymdh(:,end)==1);

N=length(idx);

idx=[idx;length(ymdh)+1];

arr=length(idx)-1:-1:1;

Jan=zeros(51681,31);

Feb=zeros(51681,29);

Mar=zeros(51681,31);

Apr=zeros(51681,30);

May=zeros(51681,31);

Jun=zeros(51681,30);

Jul=zeros(51681,31);

Aug=zeros(51681,31);

Sep=zeros(51681,30);

Oct=zeros(51681,31);

Nov=zeros(51681,30);

Dec=zeros(51681,31);

% January

for ik=1:length(idx)-arr(1)

for j=0:30 % ok, process each of these, they're one day's worth

Jan(:,j+1)=csvread(d(idx(ik)+j).name,0,2); % add hourly data

end

end

% February

if(mod(ymdh(1,1),4)==0)

mend=29;

else mend=28;

end

for ik=1:length(idx)-arr(2)

for j=0:mend-1 % ok, process each of these, they're one day's worth

Feb(:,j+1)=csvread(d(idx(ik)+j).name,0,2); % add hourly data

end

end

% March

for ik=1:length(idx)-arr(3)

for j=0:30 % ok, process each of these, they're one day's worth

Mar(:,j+1)=csvread(d(idx(ik)+j).name,0,2); % add hourly data

end

end

% April

for ik=1:length(idx)-arr(4)

for j=0:30-1 % ok, process each of these, they're one day's worth

Apr(:,j+1)=csvread(d(idx(ik)+j).name,0,2); % add hourly data

end

end

% May

for ik=1:length(idx)-arr(5)

for j=0:30 % ok, process each of these, they're one day's worth

May(:,j+1)=csvread(d(idx(ik)+j).name,0,2); % add hourly data

end

end

dpb
on 29 Mar 2015

Well, of course it does unless you make use of the information collected...as requested, I suggested (two, actually) a way to find out how to sum the full months only. It's an exercise-for-the-student to finish the detail of completing the summation of the subset of those for which

nDays(month)==Expected(month)

dpb
on 30 Mar 2015

Edited: dpb
on 30 Mar 2015

Somewhat against my better judgment (I think you need to develop some more critical thinking of solving the details of the logic problem outlined rather than relying on somebody else to provide complete solutions down to every detail), the following should work for any combination of missing days...

In the early portion of the script after building the ymdh array of dates/times from the file names insert

edpm=bsxfun(@eomday,double(ymdh(1)),[1:12].');

Then, after building the precip array, before the summation for the totals,

precip(isnan(precip(:,:,nDays==edpm)))=0;

Then revert to using builtin sum instead of nansum in

mTot=squeeze(sum(precip,2));

and joy should ensue.

I'll leave it to you to decipher why/how it works; hopefully from that exercise you'll gain a little bit of the thought process of using logical addressing within Matlab which is a key concept to grasp in effective use thereof.

ADDENDUM

The above will ensure complete months, if you do want to include any month in which there is a complete day, change the logic test from "==" to ">0". If you want there to be some minimum number of days in a month, then that would replace the 0, of course. In this case there's no need for the expected number of days/month, it's simply a test of whether there were any (or some number of) days counted.

I'll also reiterate the caution made previously in that if you do this, the data files created need to somewhere contain the information that some month(s) may be incomplete. This information is lost in the final output above; only keeping the result of the logic test for the specific month along with the totals will provide any way for the user of such data to know that is so from that file alone. Doing anything less is, in my opinion, a serious breach of scientific ethics.

One way, at a minimum, if you for some reason can't include the data in the files themselves would be to add an additional numeric code into the file name that would be the nDays(mo) value for the month. Then, at least the user would be able to check that against the expected days/month for complete datasets.

Similar arguments apply to the yearly totals, of course, but one presumes there may not be any of those that are complete in having every day of every month for the entire year. Still, encoding the days that are included would be significant information if can't include in the file the complete record of which days are missing/present.

dpb
on 31 Mar 2015

BTW, if you haven't yet figured this out, you could save tons of time if you were to run the above script once for each year subdirectory and then

save precip

in that working directory. Then you can simply

load precip

for that year and experiment with how to create and save the monthly and yearly sums to your heart's content and build whatever files from there without having to reread the same values over and over and over which is quite slow.

### See Also

### Categories

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!**An Error Occurred**

Unable to complete the action because of changes made to the page. Reload the page to see its updated state.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list

How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

Americas

- América Latina (Español)
- Canada (English)
- United States (English)

Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)

Asia Pacific

- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)