Error using kmeans ---X must have more rows than the number of clusters.

4 views (last 30 days)
I have this data:
% by layer
Sw_MF16_Tf=LeArquivo('Water Saturation Time 2020-05-31.txt',81,58,20,2);
Sw_MF16_Ti=LeArquivo('Water Saturation Time 2013-05-31.txt',81,58,20,2);
% layer variation
delta_MF16_l8 = Sw_MF16_Tf(:,:,8)-Sw_MF16_Ti(:,:,8);
delta_MF16_l9 = Sw_MF16_Tf(:,:,9)-Sw_MF16_Ti(:,:,9);
% mean L8 e L9
MF16_L8_L9_mean = (delta_MF16_l8 + delta_MF16_l9)/2;
% normalized mean dSw layer
M_MF16 = mean(MF16_L8_L9_mean,'omitnan');
S_MF16 = std(MF16_L8_L9_mean,'omitnan');
N_MF16 =(((MF16_L8_L9_mean-M_MF16)./S_MF16)./100);
% data read
data = N_MF16 ;
% Perform clustering
k = 9;
[idx, centroids] = kmeans(data, k);
% Plot clustered data
figure;
scatter(data(:,1), data(:,2), [], idx, 'filled');
title(sprintf('K-Means Clustering with k = %d', k));
xlabel('Feature 1');
ylabel('Feature 2');
colormap(parula(k));
colorbar;
% Plot centroids
hold on;
scatter(centroids(:,1), centroids(:,2), 100, 'k', 'filled');
legend('Cluster 1', 'Cluster 2', 'Cluster 3', 'Centroids');
How do I cluster this by region

Answers (1)

dpb
dpb on 9 Mar 2023
Moved: dpb on 9 Mar 2023
Wow! that's hard to read with all the obfuscated_with_underscores_and_suffixes variable names! Simplify, simplify!!
Anyway, stylistic points aside, in
...
% layer variation
delta_MF16_l8 = Sw_MF16_Tf(:,:,8)-Sw_MF16_Ti(:,:,8);
delta_MF16_l9 = Sw_MF16_Tf(:,:,9)-Sw_MF16_Ti(:,:,9);
MF16_L8_L9_mean = (delta_MF16_l8 + delta_MF16_l9)/2;
you've reduced down to a single plane of the mean of the differences of only two planes. Then, going on
% normalized mean dSw layer
M_MF16 = mean(MF16_L8_L9_mean,'omitnan');
S_MF16 = std(MF16_L8_L9_mean,'omitnan');
N_MF16 =(((MF16_L8_L9_mean-M_MF16)./S_MF16)./100);
% data read
data = N_MF16 ;
you've reduced further to a single vector of the means of each column which is a row vector.
% Perform clustering
k = 9;
[idx, centroids] = kmeans(data, k);
...
you've reduced data down to a single row by using the mean everywhere. kmeans treats a vector input as a column vector whichever orientation is passed, so the conclusion must be that there are fewer columns than 9 in your dataset. Looking at the input file, that appears to be true in that there are only six (6) columns.
What it might mean to do the means by the height of the array instead of by column, there's no way of knowing since we have no idea what the data actually are as to whether those would be meaningful effects/variables.
Since you didn't provide the function LeArquivo, nobody here could even try to poke around and see what they might make of the data -- the extreme presence of missing data would appear to be troubling.
  3 Comments
dpb
dpb on 9 Mar 2023
As above notes, there's not sufficient number of variables left to have that many groups, at least as you've structured the problem.
What's the variable that you think would segregate the data into such regions? There's certainly nothing in the other image that would indicate any reason to do so; there are a few isolates splotches of different heights (of whatever it is that is being plotted), but certainly no pattern that looks even remotely like your second image.
Luã Monteiro
Luã Monteiro on 9 Mar 2023
This is a water saturation map from an oil field which has some quilometers of extension. The ideia of clustering is to to a better well placement once it is impossible to explore a large oil field with few wells. The small patterns in geology and reservoir engeneering makes a huge difference in a well placement.
Here is a water saturation data example in .xls, if I could clustering it by region would be great.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!