MATLAB Answers

0

Why does PageRank algorithm return the same scores for all sentences?

Asked by Alexander Dumont on 23 Jan 2019
Hi Everyone
I am ust doing this code for fun, and I am intereted in automatic text generation and text summarization. I tried to implement a textrank algorithm, as per the website https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/
I rerad in a pdf file using
[file,path] = uigetfile('*.pdf');
I then do some minor cleaning of the sentences
str = lower(extractFileText([path file]));
sentences = splitSentences(str);
sentences = strrep(sentences,'-\n','');
sentences = strrep(sentences,'\n','');
sentences = strrep(sentences,' - ','');
sentences = strrep(sentences,'-','');
nw = 6;
for i = 1:length(sentences)
s = sentences{i};
s = strsplit(s,' ');
if(length(s)>nw)
l(i) = true;
end
end
sentences = sentences(l);
I then tokenize the sentences
documents = tokenizedDocument(sentences);
documents = addPartOfSpeechDetails(documents);
documents = removeStopWords(documents);
documents = normalizeWords(documents,'Style','lemma');
And then use word embdedding
emb = fastTextWordEmbedding;
sequences = doc2sequence(emb,documents,'PaddingDirection','none');
for i = 1:length(sequences)
w(:,i) = mean(sequences{i},2);
end
I then use a cosine simliarity between sentences
for i = 1:length(w)
for j = 1:length(w)
if(i~=j)
sim(i,j) = sum(w(:,i).*w(:,j))./sqrt(sum(w(:,i).^2))./sqrt(sum(w(:,j).^2));
end
end
end
And up to this point, I think everything is working, the first 5 rows and columns of the matrx sim lks like this:
ans =
5×5 single matrix
0 0.8294 0.8986 0.8466 0.8617
0.8294 0 0.8337 0.6841 0.8196
0.8986 0.8337 0 0.7743 0.8096
0.8466 0.6841 0.7743 0 0.7050
0.8617 0.8196 0.8096 0.7050 0
Finally, I try to fit this sim data into digraph:
G = digraph(sim);
And finally I run it through the PageRank algo
pr = centrality(G,'pagerank','FollowProbability',0.85,'MaxIterations',200);
However, when I inspect pr, I get all the same values:
0.0016
0.0016
0.0016
0.0016
0.0016
0.0016
0.0016
0.0016
0.0016
0.0016
And I am guessing I do not understand well enough what happens when I convert my matrix to a graph (digraph) or that I do not undertand the inputs into centrality. I have researched what I could be doing wrong, but unfortunately, there seems to be few MATLAB implementations of TextRank.Any help would be greatly appreciated.

  0 Comments

Sign in to comment.

0 Answers