Why does PageRank algorithm return the same scores for all sentences?

1 view (last 30 days)
Hi Everyone
I am ust doing this code for fun, and I am intereted in automatic text generation and text summarization. I tried to implement a textrank algorithm, as per the website https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/
I rerad in a pdf file using
[file,path] = uigetfile('*.pdf');
I then do some minor cleaning of the sentences
str = lower(extractFileText([path file]));
sentences = splitSentences(str);
sentences = strrep(sentences,'-\n','');
sentences = strrep(sentences,'\n','');
sentences = strrep(sentences,' - ','');
sentences = strrep(sentences,'-','');
nw = 6;
for i = 1:length(sentences)
s = sentences{i};
s = strsplit(s,' ');
if(length(s)>nw)
l(i) = true;
end
end
sentences = sentences(l);
I then tokenize the sentences
documents = tokenizedDocument(sentences);
documents = addPartOfSpeechDetails(documents);
documents = removeStopWords(documents);
documents = normalizeWords(documents,'Style','lemma');
And then use word embdedding
emb = fastTextWordEmbedding;
sequences = doc2sequence(emb,documents,'PaddingDirection','none');
for i = 1:length(sequences)
w(:,i) = mean(sequences{i},2);
end
I then use a cosine simliarity between sentences
for i = 1:length(w)
for j = 1:length(w)
if(i~=j)
sim(i,j) = sum(w(:,i).*w(:,j))./sqrt(sum(w(:,i).^2))./sqrt(sum(w(:,j).^2));
end
end
end
And up to this point, I think everything is working, the first 5 rows and columns of the matrx sim lks like this:
ans =
5×5 single matrix
0 0.8294 0.8986 0.8466 0.8617
0.8294 0 0.8337 0.6841 0.8196
0.8986 0.8337 0 0.7743 0.8096
0.8466 0.6841 0.7743 0 0.7050
0.8617 0.8196 0.8096 0.7050 0
Finally, I try to fit this sim data into digraph:
G = digraph(sim);
And finally I run it through the PageRank algo
pr = centrality(G,'pagerank','FollowProbability',0.85,'MaxIterations',200);
However, when I inspect pr, I get all the same values:
0.0016
0.0016
0.0016
0.0016
0.0016
0.0016
0.0016
0.0016
0.0016
0.0016
And I am guessing I do not understand well enough what happens when I convert my matrix to a graph (digraph) or that I do not undertand the inputs into centrality. I have researched what I could be doing wrong, but unfortunately, there seems to be few MATLAB implementations of TextRank.Any help would be greatly appreciated.

Answers (0)

Categories

Find more on MATLAB in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!