You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
How to replace a particular string in text file
242 views (last 30 days)
Show older comments
I have a problem related to efficiency, the code given below will replace the string and with '' an ' .' the code is working properly for small size text file ,but the main problem i am facing is that if there are approx 40,0000+ lines in text file then it is taking too much time that no one can't wait so please can anyone suggest me something different which run faster than this, Thanks in advance.
fid = fopen('input.txt','r');
f=fread(fid,'*char')';
fclose(fid);
f = regexprep(f,' ','');
f = regexprep(f,' ',' .');
fid = fopen('output.txt','w');
fprintf(fid,'%s',f);
fclose(fid);
Accepted Answer
Azzi Abdelmalek
on 18 Oct 2013
Edited: Azzi Abdelmalek
on 18 Oct 2013
strrep is faster then regexprep
f = strrep(f,' ','');
f = strrep(f,' ',' .');
17 Comments
Azzi Abdelmalek
on 18 Oct 2013
strrep is much faster, but when it comes to complex parsing, regexprep is more powerful
arun
on 18 Oct 2013
then,i can't replace these line by another one?
f = regexprep( f, '([^\n\r]+)', '<s> $1' );
f = regexprep(f,' \w*_|\,_',' ');
Cedric
on 18 Oct 2013
What is the purpose of the code? Note that if you wanted to wrap all lines in <s> and </s> tags, you could probably achieve that with
f = ['', strrep(f, '\r\n', '\r\n'), ''] ;
(I changed the order of new line and carriage return, make the change back if it is inverted for any reason in your files)
arun
on 18 Oct 2013
The purpose my code is, I have text file which contain text like
did_VBD new_JJ on_IN posted_VBN recipe_NN see_VB the_DT website_NN ._.
did_VBD new_JJ on_IN posted_VBN recipe_NN see_VB the_DT website_NN ._.
did_VBD new_JJ on_IN posted_VBN recipe_NN see_VBP website_NN the_DT ._.
did_VBD new_JJ on_IN posted_VBN recipe_NN see_VBP website_NN the_DT ._.
first of all i want to wrap all sentences like
did_VBD new_JJ on_IN posted_VBN recipe_NN see_VB the_DT website_NN
did_VBD new_JJ on_IN posted_VBN recipe_NN see_VB the_DT website_NN
did_VBD new_JJ on_IN posted_VBN recipe_NN see_VBP website_NN the_DT
did_VBD new_JJ on_IN posted_VBN recipe_NN see_VBP website_NN the_DT
and i am using the code that is 'f = regexprep(f,'\._.','</s>');' 'f = regexprep( f, '([^\n\r]+)', '<s> $1' );'
After that i want to extract the pos
VBD JJ IN VBN NN VB DT NN
VBD JJ IN VBN NN VB DT NN
VBD JJ IN VBN NN VBP NN DT
VBD JJ IN VBN NN VBP NN DT
for this i am using 'f = regexprep(f,' \w*_|\,_',' ');'
As you suggest, the code which given above
f = ['', strrep(f, '\r\n', '\r\n'), ''] ;
gives the result as
did_VBD new_JJ on_IN posted_VBN recipe_NN see_VB the_DT website_NN ._.
did_VBD new_JJ on_IN posted_VBN recipe_NN see_VB the_DT website_NN ._.
did_VBD new_JJ on_IN posted_VBN recipe_NN see_VBP website_NN the_DT ._.
did_VBD new_JJ on_IN posted_VBN recipe_NN see_VBP website_NN the_DT ._.
Cedric
on 18 Oct 2013
Edited: Cedric
on 18 Oct 2013
Did you try with
f = ['', strrep(f, '\n\r', '\n\r'), ''] ;
as I suggest in my note in parenthesis? If it works, then you can just modify it so it removes '._.' as well.. assuming that there is a white space between the last . and the newline char..
f = ['', strrep(f, '._. \n\r', '\n\r'), ''] ;
arun
on 18 Oct 2013
yes.i notice that and also tried.
f = ['', strrep(f, '._. \r\n', '\r\n'), ''] ;
f = ['', strrep(f, '._. \n\r', '\r\n'), ''] ;
f = ['', strrep(f, '._.\n\r', '\r\n'), ''] ;
f = ['', strrep(f, '._.\n\r', '\r\n'), ''] ;
and i have also tried many others but they are not working it not replacing ._. with </s> and starting string <s>
I think it is reading whole file at a time and not recognizing the new line character
fid = fopen('input.txt','r');
f=fread(fid,'*char')';
fclose(fid);
Cedric
on 18 Oct 2013
I though that you were already matching '\r\n' and that was working but too slow.. was is not working? Could you attach one of these files to your question so I can try?
arun
on 18 Oct 2013
it is not slow i am trying these on 4 sentences(lines), and i am using the code which is given above.
Cedric
on 18 Oct 2013
Edited: Cedric
on 18 Oct 2013
You wrote "the code is working properly", and later that you were using '\r' and '\n' in a regexp pattern. Was it just the first part which was working properly?
In any case, could you attach a file or a chunk of file to your question? It would be easier if I could experiment with your file, because then I can check directly what special chars you have in there and how to match them or use them in replacements. If you post a large enough file, I can also try to optimize. If you cannot attach the file to a public forum page, you can send it to me by email.
arun
on 18 Oct 2013
i have attached two file 'input.txt' and a 'code.txt' file, these are the copy of the file i am using currently to get the expected output.
Cedric
on 18 Oct 2013
Ok, try the following:
content = fileread( 'inputtextfile.txt' ) ;
newContent = strrep( content, '._. ', '' ) ;
newContent = strrep( newContent, char([13,10]), sprintf('</s>\r\n') ) ;
newContent = ['<s>', newContent, ''] ;
arun
on 19 Oct 2013
Edited: arun
on 19 Oct 2013
yes,it is working,
content = fileread( 'inputtextfile.txt' ) ;
newContent = strrep( content, '._. ', '' ) ;
newContent = strrep( newContent, char([13,10]), sprintf('</s>\r\n ') ) ;
newContent = ['<s> ', newContent,''] ;
newContent = strrep( newContent, ' ', '' ) ; % it will remove extra from the end of file
But, I think 'strrep' can't be used instead of 'rexexprep' in case of last step to get the output file:
*newContent = regexprep(newContent,' \w*_|\,_',' ');*
Cedric
on 19 Oct 2013
Edited: Cedric
on 19 Oct 2013
So you want to remove (or replace with a white space) all prefixes like 'new_', 'on_', etc, as well as precisely the string ',_' ? If so, you can simplify the process by using STRREP for removing all ',_', which allows you to reduce the OR statement in the regexp pattern and keep only the first part ' \w*_'.
If it works, then you can profile REGEXP with other patterns which could apply as well to your case and be more efficient than '\w*', e.g. '\S*'.
arun
on 19 Oct 2013
yes, now i am using
f = regexp(f,'\S*_','split')
To get the following output,
VBD JJ IN VBN NN VB DT NN
VBD JJ IN VBN NN VB DT NN
These statement are much better.
Thanks for your efforts and for your valuable suggestions.
More Answers (0)
See Also
Categories
Find more on Characters and Strings in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)