How to reshape efficiently Panel Data

I have a database where I store historical prices of ~1000 stocks for a +10yr period in a panel data format (nx4) ([Ticker] [Date] [Price] [Return]). I import the data into Matlab and then try to construct two historical matrices (one for Price and another for Returns) in the format (columns->tickers, rows->dates, values -> Price/Return). In order to do that I use the following code:
historical_returns; %panel data cell array imported from the database
historical_dates; %array that includes all historical dates
tickers; %array that includes all the tickers
Matrix_Prices = zeros(length(historical_dates),length(tickers));
Matrix_Returns = zeros(length(historical_dates),length(tickers));
for i=1:size(historical_returns,1)
temp_ticker = historical_returns{i,1};
temp_date = historical_returns{i,2};
temp_price = historical_returns{i,3};
temp_return = historical_returns{i,4};
row = find(strcmpi(historical_dates,temp_date));
column = find(strcmpi(tickers,temp_ticker));
Matrix_Prices(row,column) = temp_price;
Matrix_Returns(row,column) = temp_return;
end
The code above takes ~200sec to run assuming historical_returns has a size of 1mmx4 (which increases as the # of tickers and dates increase). I am trying to optimize the code (if possible), so I am not sure if there is a faster way to construct Matrix_Returns. I have thought of storing the data in a different format, but given the limit of column size in Access and SQL databases, I cannot create a new column for each ticker.

 Accepted Answer

Guillaume
Guillaume on 6 Jan 2015
Edited: Guillaume on 6 Jan 2015
You can replace the loop with the following:
Matrix_Prices = zeros(numel(historical_dates), numel(tickers));
Matrix_Returns = zeros(numel(historical_dates), numel(tickers));
[~, rows] = ismember(lower(historical_returns(:, 1)), lower(historical_dates)); %use lower case to make case insensitive comparison
[~, cols] = ismember(lower(historical_returns(:, 2)), lower(tickers));
indices = sub2ind(size(Matrix_Prices), rows, cols);
Matrix_Prices(indices) = cell2mat(historical_returns(:, 3));
Matrix_Returns(indices) = cell2mat(historical_returns(:, 4));

9 Comments

The code above returns the following error
Error using sub2ind (line 52)
Out of range subscript.
Sorry, I made a mistake in the predeclaration of the two output matrices. The number of rows should have been numel(historical_dates) instead of numel(historical_returns). I've now corrected my answer.
You will also get an Out of range subscript error if any of the date or ticker is not found, since in that case ismember returns 0, which is not a valid subscript. Your original code would also fail under these conditions.
The principle of the code is sound:
  • Use ismember to get the position of each date / ticker from historical_returns in historical_dates / tickers
  • Convert these positions (rows for date, columns for tickers) into linear indices with sub2ind
  • Use the linear indices to copy the prices / returns in one go into the output matrices.
I just noticed that the suggested method does not work if the tickers array includes duplicates. For example, if tickers = ['AAPL', 'AAPL'], historical_dates = ['1/2/2015' '1/5/2015'] and historical_returns = ['AAPL' '1/2/2014' '109.33' '-0.0095';'AAPL' '1/5/2014' '106.25' '-0.0282'], then the Matrix_Return using the suggested method will look like Matrix_Returns = [-0.0095 0; -0.0282 0] instead of [-0.0095 -0.0095;-0.0282 -0.0282]
Yes, it is assumed that there is no duplicates in either tickers or historical_dates. Why would you want them and how do you decide which one should receive which value anyway? Extending your example:
tickers = {'AAPL', 'AAPL'};
historical_dates = {'1/2/2015', '1/4/2015', '1/5/2015'};
historical_returns = {'1/2/2015' 'AAPL' 109.33 -0.0095
'1/4/2015' 'AAPL' 100.11 -0.0101
'1/5/2015' 'AAPL' 106.25 -0.0282};
How do you decide which row of historical_returns go into column 1 or column 2?
Note that you can use unique to remove duplicates:
tickers = unique(tickers);
The reason that there are duplicates is that the tickers are tagged, for example one trade might be 'AAPL' vs 'MSFT' ($50 each side) while another 'AAPL' vs 'IBM' ($30 each side). I already have the position matrix i.e.
Header = [ 'AAPL' 'MSFT' 'AAPL' 'IBM'
'trade1' 'trade1' 'trade2' 'trade2']
Position = [ 50 -50 -30 30
49 -51 -29.5 28 ]
Then I create the returns matrix
Matrix Return = [ 0.0000 + 0.00 0.0000 0.00
-0.0095 + 0.02 -0.0095 -0.02 ]
and estimate portfolio VaR, trade VaR etc. Note that,
tickers = Header(1,:);
This doesn't seem to be anything like what you wrote in your original question and unless the tag is also stored into historical_returns I still don't see how you decide which ticker to choose.
If the tag is also stored in historical_returns then you'd use ismember with the 'rows' options to perform the key lookup based on several columns:
[~, col] = ismember(lower(historical_returns(:, [tickercol tagcol]), lower(header'), 'rows');
sm
sm on 6 Jan 2015
Edited: sm on 7 Jan 2015
There is either a misunderstanding or I am missing something. Answering the question you asked before " How do you decide which row of historical_returns go into column 1 or column 2? ", notice that the historical_returns array does not include duplicates i.e. I store in the database only once 'AAPL' return per day. However, it might be the case that 'AAPL' has two entries in the position array for the reason explained above. Thus, when I try to construct the Matrix_Returns, the tickers might include duplicates ( tickers array include only ticker names and not the tags i.e. tickers = {'AAPL' 'MSFT' 'AAPL' 'IBM'}). Therefore, if for example 'AAPL' appears twice in the ticker array (column 1 and 3) then the Matrix_Returns column 1 and 3 should be identical since both reflect 'AAPL' return. That's what the original code does. To give you an idea about the end-goal, I have a position matrix (similar to Matrix_Returns) where I keep track of the position for each ticker (see Position matrix above). Finally, I multiply the Position matrix with the Matrix_Returns to estimate the portfolio's performance. By keeping non-unique entries it allows me to analyze particular trades (i.e. partial VaR etc), without any extra formatting. In order to clarify things further, assume that historical dates are unique, but tickers are not. As a result, if you check my original code, this implies that row will be 1x1 matrix, while column can be a 1xm matrix. The issue with your suggested code (which is noticeably faster) is that it does not allow duplicate values in tickers array.
Ok, I don't really see the point of duplicating columns in the output matrices, but if that's what you want:
[u_tickers, ~, tickerpos] = unique(tickers);
%same code as before, but using u_tickers:
Matrix_Prices = zeros(numel(historical_dates), numel(u_tickers));
Matrix_Returns = zeros(numel(historical_dates), numel(tickers));
[~, rows] = ismember(lower(historical_returns(:, 1)), lower(historical_dates));
[~, cols] = ismember(lower(historical_returns(:, 2)), lower(u_tickers));
indices = sub2ind(size(Matrix_Prices), rows, cols);
Matrix_Prices(indices) = cell2mat(historical_returns(:, 3));
Matrix_Returns(indices) = cell2mat(historical_returns(:, 4));
%now replicate columns according to tickerpos:
Matrix_Prices(:, 1:numel(tickers)) = Matrix_Prices(:, tickerpos);
Matrix_Returns(:, 1:numel(tickers)) = Matrix_Returns(:, tickerpos);
Thank you. The code above works great.

Sign in to comment.

More Answers (1)

This is a one-liner using the unstack function on a table. It would be something like
wideData = unstack(tallData,{'Price' 'Return'},'Ticker')
where tallData is a table with four variables, Ticker, Date, Price, and Return. Hope this helps.

Categories

Find more on Financial Toolbox in Help Center and File Exchange

Tags

Asked:

sm
on 6 Jan 2015

Commented:

sm
on 7 Jan 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!