Split function for text ( split(...) or strsplit(...))
Show older comments
Hello,
I have a pdf document, How do I can divide it by full stops (whole paragraphs) using split(...) or strsplit(...)?. For instance, For this text composed by these two paragraphs, I need split the text into two paragrah divided by full stop.
"
In economics the demand curve is the graphical representation of the relationship between the price and the quantity that consumers are willing to purchase. The curve shows how the price of a commodity or service changes as the quantity demanded increases. Every point on the curve is an amount of consumer demand and the corresponding market price. The graph shows the law of demand, which states that people will buy less of something if the price goes up and vice versa.
The slope of a linear demand curve is constant. The elasticity of demand changes continuously as one moves down the demand curve because the ratio of price to quantity continuously falls. At the point the demand curve intersects the y-axis PED is infinitely elastic, because the variable Q appearing in the denominator of the elasticity formula is zero there. At the point the demand curve intersects the x-axis PED is zero, because the variable P appearing in the numerator of the elasticity formula is zero there.[2] At one point on the demand curve PED is unitary elastic: PED equals one. Above the point of unitary elasticity is the elastic range of the demand curve (meaning that the elasticity is greater than one). Below is the inelastic range, in which the elasticity is less than one. The decline in elasticity as one moves down the curve is due to the falling P/Q ratio.
"
Thanks.
6 Comments
"...into two paragrah divided by full stop. "
This is a little confusing. It seems you want to break up the text into sentences (full stop = ".") and then break it up further by paragraphs? Do you want a cell array where the first element contains a set of strings, all of which are single sentences from the first paragraph; and then the second element is a set of strtings, all of which are single sentences from the 2nd paragraph?
Also, how are you reading the text into matlab?
Lastly, what's wrong with split(txt,'.')?
Adam Danz
on 6 Jun 2019
What does this look like?
str = extractFileText(filename);
t = split(str{:},newline);
Also, please use this comment section unless you're proposing an answer.
Humberto Bernal
on 6 Jun 2019
Adam Danz
on 6 Jun 2019
Yes, that's expected. But do you have empty elements in the cell array for lines that are between paragraphs? If so, they can be used to parse the paragraphs.
Humberto Bernal
on 6 Jun 2019
Adam Danz
on 6 Jun 2019
See my answer below.
Answers (1)
Try this out. I don't have your data so I'm taking a shot in the dark. It may require a small tweak.
str = extractFileText(filename);
t = split(str{:},newline);
emptyLineIdx = cellfun(@isempty,t); %find empty rows
paraGroups = cumsum(emptyLineIdx)+1; %assign paragraph group number to each line
t(emptyLineIdx) = []; %get rid of the empty lines
paraGroups(emptyLineIdx) = [];
c = splitapply(@(x){strjoin(x,'\n')},t,paraGroups) % produce cell array; one element per paragraph.
I feel like there's a more direct way to do this but this approach should also work. I wonder if there's a "new paragraph" indicator in regular expressions.
2 Comments
Stephen23
on 7 Jun 2019
Thanks Adam for your answer,
Yes I want two sentences,The first one has to contain the first paragraph and the second sentence have to contain the secon paragraph. The idea is that I can analyse the text by paragrahs which are diveded by full stop (.).
rng('default')
filename = "Deamand.pdf";
str = extractFileText(filename);
data = readPDFFormData(filename);
newDocuments = strsplit(str, "?");
newDocuments_1 = erasePunctuation(newDocuments);
.
.
.
.
Categories
Find more on Data Type Conversion in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!