Code covered by the BSD License  

Highlights from
Pivot/unPivot

image thumbnail

Pivot/unPivot

by

 

16 Dec 2009 (Updated )

Group a flat dataset into pivot table and unpivot it back

PIVOT AND UNPIVOT: a user's guide

PIVOT AND UNPIVOT: a user's guide

The scope of these two functions is to organize a flat dataset into a PIVOT table (and eventually undo it). This guide will illustrate how to use them.

Contents

CONCEPT: flat dataset

A flat dataset usually contains a column of values and "n" columns of features. So, each row hs its value with its features.

1. A flat dataset with two features (Region and Year)

load Demodata.mat
display([{'Year','Region','People'}; Demodata.ex1])
    'Year'    'Region'    'People'
    [2004]    'USA'       [ 38889]
    [2002]    'EUROPE'    [ 45475]
    [2008]    'USA'       [ 24669]
    [2007]    'EUROPE'    [ 78443]
    [2008]    'USA'       [ 88284]
    [2002]    'EUROPE'    [ 91372]
    [2006]    'USA'       [ 55829]
    [2006]    'EUROPE'    [ 59887]
    [2001]    'USA'       [ 14888]
    [2004]    'EUROPE'    [ 89972]

CONCEPT: pivot table

A pivot table as intended by the T-SQL command pivot or as implemented by spreadsheet softwares is a data visualization tool which groups a flat dataset by two chosen dimension.

2. Format Demodata.ex1 into a pivot table

warning off warnPivot:funGroup      % we'll come back later on the warning
display(Pivot(Demodata.ex1))
    [   NaN]    [ 2001]    [  2002]    [ 2004]    [ 2006]    [ 2007]    [  2008]
    'EUROPE'    [  NaN]    [136847]    [89972]    [59887]    [78443]    [   NaN]
    'USA'       [14888]    [   NaN]    [38889]    [55829]    [  NaN]    [112953]

PIVOT input constraints

$$Pivot(In, Fun, noHeaders, Pad)$$

Not every type of input is supported by PIVOT. Only a flat dataset with max three columns is supported. The first column should be the grouping dimension which will appear as the column header and the second column the dimension that will appear as the row header. The last column should contain the values that will be "intersected" and grouped.

Here are listed all the constraints to the inputs (please mail me if something is left out):

  • IN should have 3 columns
  • IN can be numeric or a cell array of scalars or strings if headers (no mixing)
  • Nested cells are not allowed
  • NaNs or empty cells in the first two columns of IN are not allowed
  • If IN is a cell the column with the values must be a cell array of scalars
  • FUN must be a function handle
  • FUN applied to a vector must return a scalar
try Pivot([1,3;2,4]); catch; err = lasterror; display(err.message); end %#ok
try Pivot('try'); catch; err = lasterror; display(err.message); end %#ok
try Pivot({'1999',1,2;1999,1,2}); catch; err = lasterror; display(err.message); end %#ok
try Pivot({1,2,3;1,{2},3}); catch; err = lasterror; display(err.message); end %#ok
try Pivot({NaN,1,9;1,3,4}); catch; err = lasterror; display(err.message); end %#ok
try Pivot([1,NaN,9;1,3,4]); catch; err = lasterror; display(err.message); end %#ok
try Pivot({[],23,9;1,3,4}); catch; err = lasterror; display(err.message); end %#ok
try Pivot({23,[],9;1,3,4}); catch; err = lasterror; display(err.message); end %#ok
try Pivot({23,1,9;1,3,'9'}); catch; err = lasterror; display(err.message); end %#ok
try Pivot(Demodata.ex1(2:end,:),'sum'); catch; err = lasterror; display(err.message); end %#ok
try Pivot(Demodata.ex1(2:end,:),@diff); catch; err = lasterror; display(err.message); end %#ok
Error using ==> Pivot at 60
IN must be a n by 3 matrix
Error using ==> Pivot at 80
IN can be numeric or a cell array of strings or numbers
Error using ==> Pivot at 74
IN can be numeric or a cell array of strings or numbers
Error using ==> Pivot at 94
IN can be numeric or a cell array of strings or numbers
Error using ==> Pivot at 73
Column 1, NaNs not allowed
Error using ==> Pivot at 99
Column 2, NaNs not allowed
Error using ==> Pivot at 67
Column 1, empty cell not allowed
Error using ==> Pivot at 87
Column 2, empty cell not allowed
Error using ==> Pivot at 108
Column 3 should be a cell array of scalars
Error using ==> Pivot at 120
FUN invalid format
Error using ==> Pivot at 160
DIFF is not supported for grouping operations.

PIVOT examples

How to use this function:

1. IN numeric or cellarray of scalars. No grouping involved. Padded by default with NaN

disp(Pivot(Demodata.ex2))
disp(Pivot(num2cell(Demodata.ex2))) % OUT will be automatically converted to numeric
         NaN        2002        2006        2007        2008
           1        3174         NaN         NaN         NaN
           3         NaN         NaN        8141         NaN
           4        8196         NaN         NaN         NaN
           5         NaN        8366         NaN        3032

         NaN        2002        2006        2007        2008
           1        3174         NaN         NaN         NaN
           3         NaN         NaN        8141         NaN
           4        8196         NaN         NaN         NaN
           5         NaN        8366         NaN        3032

2. Grouping involved using default FUN (@sum) or @min

disp(Pivot(Demodata.ex3))
disp(Pivot(Demodata.ex3,@min))
         NaN           1           2           3           4
        1999         -11          -4         -21         NaN
        2000          -9         -15         -49         NaN
        2001          16          -2          29         NaN
        2010         NaN         NaN         NaN           1

         NaN           1           2           3           4
        1999         -15          -3         -21         NaN
        2000         -13         -13         -25         NaN
        2001          -1          -4          -7         NaN
        2010         NaN         NaN         NaN           1

3. OUT without headers

disp(Pivot(Demodata.ex3,[],true))
   -11    -4   -21   NaN
    -9   -15   -49   NaN
    16    -2    29   NaN
   NaN   NaN   NaN     1

4. Pad with zeros or with '#' and no headers (OUT will be cell).

disp(Pivot(Demodata.ex3,[],false,0))
disp(Pivot(Demodata.ex3,[],1,'#'))
         NaN           1           2           3           4
        1999         -11          -4         -21           0
        2000          -9         -15         -49           0
        2001          16          -2          29           0
        2010           0           0           0           1

    [-11]    [ -4]    [-21]    '#'
    [ -9]    [-15]    [-49]    '#'
    [ 16]    [ -2]    [ 29]    '#'
    '#'      '#'      '#'      [1]

5. The same syntaxes could be used with IN headers as cellarrays of strings

disp(Pivot(Demodata.ex4))
disp(Pivot(Demodata.ex4,@max))
disp(Pivot(Demodata.ex4,@max,1))
disp(Pivot(Demodata.ex4,@max,1,'???'))
    [ NaN]    'Group1'    'Group2'    'Group3'    'Group4'
    [1999]    [   -11]    [    -4]    [   -21]    [   NaN]
    [2000]    [    -9]    [   -15]    [   -49]    [   NaN]
    [2001]    [    16]    [    -2]    [    29]    [   NaN]
    [2010]    [   NaN]    [   NaN]    [   NaN]    [     1]

    [ NaN]    'Group1'    'Group2'    'Group3'    'Group4'
    [1999]    [    20]    [    -1]    [   -21]    [   NaN]
    [2000]    [     7]    [     6]    [    -4]    [   NaN]
    [2001]    [    16]    [     2]    [    26]    [   NaN]
    [2010]    [   NaN]    [   NaN]    [   NaN]    [     1]

    [ 20]    [ -1]    [-21]    [NaN]
    [  7]    [  6]    [ -4]    [NaN]
    [ 16]    [  2]    [ 26]    [NaN]
    [NaN]    [NaN]    [NaN]    [  1]

    [ 20]    [ -1]    [-21]    '???'
    [  7]    [  6]    [ -4]    '???'
    [ 16]    [  2]    [ 26]    '???'
    '???'    '???'    '???'    [  1]

6. Full output

[Out,colHeader,rowHeader,Settings] = Pivot(Demodata.ex4,@max,1,'???');
disp([{'colHeader','rowHeader'};[colHeader,rowHeader]])
disp('    Settings')
disp(Settings)
    'colHeader'    'rowHeader'
    'Group1'       [     1999]
    'Group2'       [     2000]
    'Group3'       [     2001]
    'Group4'       [     2010]

    Settings
    Aggregation_function: @max
            With_headers: 0
             Padded_with: '???'

PIVOT warning

We set warning off warnPivot:funGroup before. This was meant to avoid the following message

disp(lastwarn)
SUM is being used as grouping function

being thrown each time PIVOT is applied to a dataset which has multiple values on the same intersection between the two grouping dimensions. The warning is meant to remember the user that some values are being grouped

UNPIVOT input constraints

$$unPivot(In, dim, rmPad)$$

Only pivot tables are accepted as inputs. The first elements of IN should be NaN. The first row and the first column (except the IN(1) element) are the headers/features that will define the values in IN(2:end,2:end).

Here are listed all the constraints to the inputs (please mail me if something is left out):

  • IN should be a pivot table
  • IN(1) must be NaN
  • Nested cells are not allowed
  • DIM can be 1 or 2 or empty
  • RMPAD can be a char or a scalar
try unPivot([1,3]); catch; err = lasterror; display(err.message); end %#ok
try unPivot({'G1',1999;'G2',24}); catch; err = lasterror; display(err.message); end %#ok
try unPivot({NaN,1999;'G2',{24}}); catch; err = lasterror; display(err.message); end %#ok
try unPivot(Demodata.ex5,3); catch; err = lasterror; display(err.message); end %#ok
try unPivot(Demodata.ex5,'1'); catch; err = lasterror; display(err.message); end %#ok
try unPivot(Demodata.ex5,[],{23}); catch; err = lasterror; display(err.message); end %#ok
Error using ==> unPivot at 40
IN is not a pivot table
Error using ==> unPivot at 42
IN first element must be NaN
Error using ==> unPivot at 45
IN nested cells not admitted
Error using ==> unPivot at 56
DIM must be 1 or 2
Error using ==> unPivot at 58
DIM invalid format
Error using ==> unPivot at 65
RMPAD invalid format

UNPIVOT examples

How to use this function:

1. Simplest case

IN numeric or cellarray of scalars. Removes by default the NaNs. Sorts by default according the column headers (second column in the unPivoted table)

A = unPivot(Demodata.ex5); disp(A)
B = unPivot(Demodata.ex5,[]);
C = unPivot(Demodata.ex5,1);
if isequal(A,B,C); fprintf('A, B and C are equivalent'); else error('something''s wrong'); end
           1        2002        3174
           4        2002        8196
           5        2006        8366
           3        2007        8141
           5        2008        3032

A, B and C are equivalent

2. Sort by row header (first column in the unPivoted table)

disp(unPivot(Demodata.ex5,2))
           1        2002        3174
           3        2007        8141
           4        2002        8196
           5        2006        8366
           5        2008        3032

3. Remove padded value other than NaN

disp(unPivot(Demodata.ex5,1,3174))
           4        2002        8196
           5        2006        8366
           3        2007        8141
           5        2008        3032

4. Same syntaxes can be applied to a cell array. Empty cells removed by default.

disp(unPivot(Demodata.ex6))
disp(unPivot(Demodata.ex6,2))
disp(unPivot(Demodata.ex6,[],20))
disp(unPivot(Demodata.ex6,2,'???'))
    [1999]    'Group1'    [ 20]
    [2000]    'Group1'    [  7]
    [2001]    'Group1'    [ 16]
    [2010]    'Group1'    '???'
    [1999]    'Group2'    [ -1]
    [2000]    'Group2'    [  6]
    [2010]    'Group2'    '???'

    [1999]    'Group1'    [ 20]
    [1999]    'Group2'    [ -1]
    [2000]    'Group1'    [  7]
    [2000]    'Group2'    [  6]
    [2001]    'Group1'    [ 16]
    [2010]    'Group1'    '???'
    [2010]    'Group2'    '???'

    [2000]    'Group1'    [  7]
    [2001]    'Group1'    [ 16]
    [2010]    'Group1'    '???'
    [1999]    'Group2'    [ -1]
    [2000]    'Group2'    [  6]
    [2010]    'Group2'    '???'

    [1999]    'Group1'    [20]
    [1999]    'Group2'    [-1]
    [2000]    'Group1'    [ 7]
    [2000]    'Group2'    [ 6]
    [2001]    'Group1'    [16]

Performance issues

Lets test the performances of PIVOT and UNPIVOT.

First we create our Test input (which isn't included in Demodata since its about 35 MB)

    Nums1 = ceil(rand(100,1)*100);
    Nums2 = num2str(ceil(rand(100,1)*100));
    Test = [cellstr(strcat('Year', num2str(Nums1+1900))),...
            cellstr(strcat('Group', Nums2)),...
            num2cell(Nums1)];
    Test = repmat(Test,1000,1);

Then using the profile...

    profile on
    Testpivoted = Pivot(Test,[],[],' ');
    Testback = unPivot(Testpivoted,1,' ');
    profile off
    stats = profile('info');
    stats = [{stats.FunctionTable.FunctionName}',{stats.FunctionTable.TotalTime}'];
    disp(stats([strmatch('Pivot',stats(:,1)); strmatch('unPivot',stats(:,1))],:))
    'Pivot'                  [8.0240]
    'Pivot>cell2floatmod'    [6.6298]
    'unPivot'                [0.0631]

While UNPIVOT is fast enough PIVOT is slowed down by cell2mat like behaviour. Here is implemented cell2floatmod which is a compact version of the cell2float by Jos(10584)

Also notice that the Test that was originally created is different from the unpivoted version of the same Test because of the grouping.

    fprintf('Are they equal: ''%d''', isequalwithequalnans(Test, Testback))
Are they equal: '0'

Other Pivot-like functions

Statistics Toolbox owners:

pivoting could be accomplished on a dataset obj by the functions stack and unstack. These functions use a similar engine approach to Pivot and unPivot and the performances lie on the same scale but the syntax is different.

FEX submission 20963-reshape-a-matrix by Dimitri Shvorob (ID:17777):

The functions are called wide2tall and tall2wide and are probably the foundations for the statistics TB functions because of the syntax and the names. Has drawbacks on performance.

Contact us