How to deal with a vector containing numbers and characters?

Having a vector of mixed numbers and characters like
M=['(1,3)',
'(5,4)',
'(9,7),(6,8)',
'(7,3)',
'(7,9),(2,4),(3,10)',
'(5,8)']
how can I compute the minimal set of numbers S such that each row of M has at least one number belonging to set S?
For the above simple example the answer is
S=[1 5 7]
However, I need a function that can compute the set S for M with any large size.

 Accepted Answer

"how can I compute the minimal set of numbers S such that each row of M has at least one number belonging to set S?"
You could try my function mincoverset (attached). For small problems, like the one in your question, it will finish and return the set/s. For untractably large problems, like your .mat file data, you should set the first argument to limit how it will run for. When I tried it for ten minutes it had found several 359-element minimum cover sets:
>> S = load('M.mat');
>> M = S.M;
>> C = cellfun(@(s)sscanf(s,'(%d,%d),'),M,'uni',0);
>> Z = mincoverset(10/(60*24), C{:});
>> size(Z)
ans =
6 359
"For the above simple example the answer is S=[1 5 7]"
Actually I found six minimal cover sets for your example data:
>> M = {'(1,3)','(5,4)','(9,7),(6,8)','(7,3)','(7,9),(2,4),(3,10)','(5,8)'};
>> C = cellfun(@(s)sscanf(s,'(%d,%d),'),M,'uni',0);
>> Z = mincoverset(Inf, C{:})
Z =
1 5 7
3 5 9
3 5 7
3 5 6
3 5 8
3 4 8
An alternative method would be to represent the data in conjunctive normal form, and then convert to a disjunctive normal form:

9 Comments

Hi Stephan, I faced with the below error after running your code.
Error using cellfun
Unknown option.
Error in mincoverset (line 35)
isc = all(cellfun('ischar',varargin));
Error in DetSet (line 390)
Z = mincoverset(C{:}) % run as long as you want, then ctrl+c
I'm using MATLAB R2017.a.
isc = all(cellfun('isclass',varargin,'cell'));
Now it works for the simple example. But for the .mat file I faced to the below error after pressing ctrl+c after 20 min.
Operation terminated by user during mincoverset/recurse (line 68)
In mincoverset/recurse (line 66)
recurse(X+1, V);
In mincoverset/recurse (line 68)
recurse(X+1, [V,varargin{X}(k)]);
.
.
.
.
.
.
.
In mincoverset/recurse (line 68)
recurse(X+1, [V,varargin{X}(k)]);
In mincoverset/recurse (line 68)
recurse(X+1, [V,varargin{X}(k)]);
In mincoverset/recurse (line 68)
recurse(X+1, [V,varargin{X}(k)]);
In mincoverset (line 45)
recurse(1, [])
@Hossein: when you terminate the code using ctrl+c it is as if an error occured at the point where the code was terminated. That means the error message is entirely expected. Follow the instructions in my answer to get the most recent best sets.
I am following the procedure. The error is still there. The function works for small size arrays (with around 20 elements). For example it found the solution when I modified M.mat so that it had only the first 17 elements and it found the solution in 12 seconds. But for a larger file it faces with error. It could not solve for a 50 elements M even after 20 minutes.
@Hossein: you have not thought about how complex this problem actually is. While the problem might be theoretically solvable, in practice all humans will have ceased to exist and this universe will have long since achieved some kind of stable low-energy state before you would have solved this problem using a brute-force method like my code uses.
Consider the example in your question, it has 384 combinations that could be checked, although some of them might be skipped by my function (because it does not check sets that are already longer than the current best solution, but this depends on the data!). This gives us an approximate measure of how fast this algorithm is:
>> M = {'(1,3)','(5,4)','(9,7),(6,8)','(7,3)','(7,9),(2,4),(3,10)','(5,8)'};
>> C = cellfun(@(s)sscanf(s,'(%d,%d),'),M,'uni',0);
>> N1 = prod(cellfun('prodofsize',C)) % max combinations
N1 = 384
>> tic, Z = mincoverset(Inf,C{:}), T1 = toc
Z =
1 5 7
3 5 9
3 5 7
3 5 6
3 5 8
3 4 8
T1 = 0.077000
So approximately 384 combinations in 0.08 seconds. Not bad. Now lets look at your .mat file problem:
>> S = load('M.mat');
>> M = S.M;
>> C = cellfun(@(s)sscanf(s,'(%d,%d),'),M,'uni',0);
>> N2 = prod(cellfun('prodofsize',C)) % max combinations
N2 = 4.1714e+170
Note that it has many more combinations than this entire universe has particles. How long would it take to process them all? We can get an approximation of this time, in seconds:
>> N2*T1/N1
ans = 8.3645e+166
Seconds are not very useful. How about in years?:
>> Y2 = N2*T1/N1/(60*60*24*365)
Y2 = 2.6524e+159
>> num2words(Y2, 'sigfig',1, 'type','money', 'unit','years')
ans = three duoquinquagintillion years
Well, that is probably the first and last time I will ever use the word "duoquinquagintillion" ! Note that even if my calculation is wrong by a few decillion, or even by a few centillion, the required time will not be anything like "20 minutes" as you wrote in your comment.
Even for the first fifty elements it will take a few years to calculate:
>> C = cellfun(@(s)sscanf(s,'(%d,%d),'),M(1:50),'uni',0);
>> N3 = prod(cellfun('prodofsize',C)) % max combinations
N3 = 3377699720527872
>> Y3 = N3*T1/N1/(60*60*24*365)
Y3 = 22313.76112
But you are right about the usability of the function: I have improved this: the first input now specifies the maximum number of days that you want to run the function for, e.g. for 10 minutes use this:
mincoverset(10/(60*24), ...)
It will then return the current best covering set/s. I have updated the question with the new function.
Thanks a lot Stephan. Really appreciate your help.
@Hossein: cellfun supports a few special syntaxes (which are quite fast), where prodofsize is equivalent to the product of the size of the input array, i.e.:
cellfun(@(x)prod(size(x)),...)
or equivalently:
cellfun(@numel,...)
"I don't know how the number of permutations has been obtained? For example, how for 6 elements in C, the number of permutations is 384?"
The number of combinations (not permutations) is simply how many elements are in each vector within C, all multiplied together:
>> V = cellfun(@numel,C) % equivalent to 'prodofsize'
V =
2 2 4 2 6 2
>> prod(V)
ans = 384
Another way to check this is to download Jos' excellent allcomb:
>> size(allcomb(C{:}),1)
ans = 384
We take the combinations because their order is irrelevant.
Thanks Stephan. If the elements in C be the generators of automorphisms of a graph, is there a way that we could compute the number of automorphisms obtained from C?
Also, is there a method to derive all automorphisms from the set of generators?

Sign in to comment.

More Answers (1)

a = {'(1,3),'
'(5,4),'
'(9,7),(6,8),'
'(7,3),'
'(7,9),(2,4),(3,10),'
'(5,8)' };
nums = cellfun(@str2double, regexp(a, '\d+', 'match'), 'uniform', 0);
nums will now be a cell array of numeric row vectors.
Finding the minimal set would come after that. I think perhaps you might be able to use the sorts of techniques used to find a covering set
For example if you were to take nums and form the matrix
M = [
1 0 1 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0 0 0
0 0 0 0 0 1 1 1 1 0
0 0 1 0 0 0 1 0 0 0
0 1 1 1 0 0 1 0 1 1
0 0 0 0 1 0 0 1 0 0
]
then the question would be to find the binary x with the lowest sum such that M * x' >= 1 for each row
This could probably be found with integer programming.

16 Comments

Unfortunately, I had a mistake typing my question. It is edited now. The matrix 'a' is the same as M.
Please see the question again.
M = {'(1,3),'
'(5,4),'
'(9,7),(6,8),'
'(7,3),'
'(7,9),(2,4),(3,10),'
'(5,8)' };
nums = cellfun(@str2double, regexp(M, '\d+', 'match'), 'uniform', 0);
The M you posted is not valid MATLAB.
Perhaps your M is a single character vector that has embedded newline characters?
I edited M as well like the M you proposed.
Please attach a sample data as a .mat file.
My real data has obtained from another software. It has 788 rows. The first rows are
(985,1019),
(982,1016),(983,1017),(984,1018),
(981,1015),
(980,1014),
(977,1011),(978,1012),(979,1013),
(976,1010),
(973,1007),(974,1008),(975,1009),
(972,1006),
(969,1003),(970,1004),(971,1005),
(968,1002),
(967,1001),
I don't know how to save it in a .mat file.
Use the save command to save it as a .mat . For example,
save 415738.mat M
The regexp() I posted works fine with your M.
>> celldisp(nums(1:20))
ans{1} =
1174 1175
ans{2} =
1149 1162
ans{3} =
1148 1149
ans{4} =
1147 1148
ans{5} =
1145 1146
ans{6} =
1138 1140
ans{7} =
1137 1160
ans{8} =
1134 1136
ans{9} =
1130 1131
ans{10} =
1130 1134 1131 1136 1132 1135
ans{11} =
1128 1129
ans{12} =
1127 1133
ans{13} =
1109 1114
ans{14} =
1108 1109
ans{15} =
1107 1108
ans{16} =
1106 1107
ans{17} =
1104 1105
ans{18} =
1092 1106
ans{19} =
1091 1092
ans{20} =
1090 1093
Could you please help me how to find the set S from the .mat file that I attached?
I think you can frame this as a Vertex Cover problem. If I am correct, then use the routine grMinVerCover() from https://www.mathworks.com/matlabcentral/fileexchange/4266-grtheory-graph-theory-toolbox
Vertex cover will lead to a very conservative solution than what I am looking for.
You asked for the minimal set, and solving the Vertex Cover would give you the minimal set. Either the set is one of the minimal ones or it is not, so if it calculates the solution correctly than it cannot be too conservative.
The vertex cover for a graph with edges like
{(1,2),
(2,3),
(1,3),
(2,4),
(2,5),
(2,6)}
will lead to S=[2,3] but for the same data set with the below setup
{(1,2),
(2,3),
(2,4),
(2,5),(1,3),
(2,6)}
we have S=[2].
Anyway, it might give a clue mentioning that my original problem is "finding the determining set or fixing set of a graph", and the data set M.mat is the generators of the automorphism group of the graph.
You are not setting up the right vertex cover problem.
Create one vertex for each unique value in the list of values. Then, for each row in the M table, create a vertex for it and create an edge from that vertex to each of the nodes mentioned in the row, independent of order. Now run the vertex cover algorithm.
Thanks for your time but I am not getting your explanations.
+----------------------------------------+
| |
(1)------------------------[1, 2] |---[1,3]
| |
---(2)-------------------------+ +--[2, 3] |
| | | | |
| |---------------------------------+ | |
| | |
| (3)------------------------------------+ |
| | |
| +--------------------------------------------+
+----------------------------------------------------[2,4]
|
(4)-----------------------------------------------+
and so on.
Each node in your table gets connected to each of the nodes in the list of numbers for that node.
Now run the Vertex Cover algorithm. Or the Edge Cover algorithm, whichever turns out to be more appropriate.

Sign in to comment.

Tags

Asked:

on 22 Aug 2018

Commented:

on 28 Aug 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!