Splitting data array into sub arrays

Hi, I want to split the data of an array by the first 2 numbers of each data. This is to split firms by the first 2 digit of their SIC codes.
Example: 3301 would be in the sub array 33 which would represent a sector.
Any suggestions would be helpful.
Regards,
Frank

2 Comments

Would need more than just one sample number to know the format...if it's always the first two digits of a variable-length number/string, then converting to character and extracting the first two characters is probably as simple a solution as any.
If it's always a four-digit number like the example, then
mod(v,100)
would work, but wouldn't if were a five- or three-digit number.
Need a complete definition of the input patterns possible.
Thanks for the help.
The format are four digits numbers that needs to be classified by the first two numbers.
Example : 3301, 4502, 3306, 4602, 4510...
Both numbers that starts with 33 and 45 would be classified into the same sub array so I would then be able to easily compute the median of each sub array.
Speaking of scale there's about 100 sub array possible on about 20 000 firms.
Regards,
Frank

Sign in to comment.

 Accepted Answer

>> SIC=[3301, 4502, 3306, 4602, 4510].';
>> splitapply(@median,SIC,findgroups(fix(SIC/100)))
ans =
3303.50
4506.00
4602.00
>>
NB: The use of mod above was in error, dunno how I came up with that, but as long as they're all four-digit codes, the above should be about as easy as it gets.

7 Comments

Thanks again for the help.
But in fact, those numbers which are firm's SIC code are associate to other numbers which are those that I need to compute the median.
Ex: 3301 12 32 21 92
4502 32 45 32 65
So like that, I need to regroup the firms that have the same first 2 numbers and then compute the median for few key stats.
Regards,
dpb
dpb on 21 Jan 2021
Edited: dpb on 21 Jan 2021
I thought it peculiar to use the SID as median, but...that's all we had to work on! :)
Same code, just use the variable of interest as the array argument in place of SIC; the grouping variable is still the same.
>> SIC=[3301 12 32 21 92
4502 32 45 32 65];
>> splitapply(@median,SIC(:,2:end),findgroups(fix(SIC(:,1)/100)))
ans =
26.50
38.50
>> median(SIC(1,2:end)) % to double-check...
ans =
26.50
>> d
Or, put it all into a table and use rowfun and you'll get a new, neat table that has everything all put together for you.
>> tSIC=table(SIC(:,1),SIC(:,2:end),'VariableNames',{'SIC','Data'})
tSIC =
2×2 table
SIC Data
_______ ________________________________
3301.00 12.00 32.00 21.00 92.00
4502.00 32.00 45.00 32.00 65.00
>> rowfun(@median,tSIC,'GroupingVariables','SIC', ...
'InputVariables','Data', ...
'OutputVariableNames','Median')
ans =
2×3 table
SIC GroupCount Median
_______ __________ ______
3301.00 1.00 26.50
4502.00 1.00 38.50
>>
ADDENDUM:
If there are multiple lines of each code then you'll have to modifiy the function a little to compute the median over all elements; otherwise by default median will operate over columns.
rowfun(@(v)median(v,'all'),tSIC,'GroupingVariables','SIC', ...
'InputVariables','Data', ...
'OutputVariableNames','Median')
where we've used an anonymous function so can specify the optional 'all' parameter to median
Thanks.
Is there a reason why it works with 2 row but when adding one more (in facts I need to add a lot more) it won't work.
Best regards,
"Is there a reason why..."
I'm sure there is, but w/o even what error it is you're getting and no example data or code, until the MindReading Toolbox is finally released, it'll be hard to say what that reason might be!
I often complain that the crystal ball is back in the shop for repair again, too.... :)
Yeah I agree!
I found the error and it worked.
The only thing I wish I could do would be to group every SIC code by forming groups that are represented by the first 2 digit of the code.
Example:
SIC = [3301, 4502, 3306, 4602, 4510];
This would be 3 groups that are : 33, 45 and 46. Then at this point, I would be able to compute the median of each industry much more easily.
Best regards,
Frank
I showed you how to do that -- use fix(SIC/100) as the grouping variable.
However, that said, I see that I forgot to include that in the previous code and just used SIC.
The way things work in identifying the grouping variable, you can't pass a calculation in rowfun so you need to create the variable for the purpose.
>> tSIC.Industry=fix(tSIC.SIC/100); % define industry code for grouping
>> tSIC=tSIC(:,[1 end 2]); % rearrange table for convenience
>> tSIC % show resulting table
tSIC =
2×3 table
SIC Industry Data
____ ________ ____________________
3301 33 12 32 21 92
4502 45 32 45 32 65
>> rowfun(@median,tSIC,'GroupingVariables','Industry', ...
'InputVariables','Data', ...
'OutputVariableNames','Median')
ans =
2×3 table
Industry GroupCount Median
________ __________ ______
33 1 26.5
45 1 38.5
>>
As noted above, will need to modify to handle multiple cases of the same ID by using the anonymous function and the 'all' parameter to compute overall group median.
Awesome!
I forgot change the 'SIC' to 'Industry' but finally figured it out.
Best regards,
Frank

Sign in to comment.

More Answers (0)

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!