How to get multiline matches using regular expressions (regexp)

5 views (last 30 days)
I'm trying to filter a large data file for some particular information. This is how a part of the data file looks like.
text = ['<node id="309134964" lat="48.0685823" lon="11.6592565" version="4" timestamp="2015-02-16T12:52:33Z" changeset="28884856" uid="8748" user="ToniE">'...
'<tag k="power" v="substation"/>'...
'<tag k="source" v="survey"/>'...
'<tag k="operator" v="Energieversorgung Ottobrunn"/>'...
'</node>'...
'<node id="309202573" lat="49.0064035" lon="9.1332687" version="6" timestamp="2015-08-09T09:24:34Z" changeset="33215175" uid="2672520" user="Stingray80"/>'...
'<node id="309209816" lat="47.9344289" lon="11.1041431" version="1" timestamp="2008-11-01T19:21:22Z" changeset="651519" uid="39150" user="account_deleted_1011"/>'...
'<node id="309209818" lat="47.9335507" lon="11.103726" version="2" timestamp="2014-07-30T20:48:19Z" changeset="24451882" uid="12096" user="HCX Biker"/>'...
'<node id="309209819" lat="47.9333751" lon="11.1045838" version="2" timestamp="2011-03-24T18:45:17Z" changeset="7658567" uid="313675" user="alphax"/>'...
'<node id="309209822" lat="47.9339823" lon="11.1047609" version="1" timestamp="2008-11-01T19:21:22Z" changeset="651519" uid="39150" user="account_deleted_1011"/>'...
'<node id="309209824" lat="47.9342688" lon="11.1048045" version="1" timestamp="2008-11-01T19:21:22Z" changeset="651519" uid="39150" user="account_deleted_1011"/>'...
'<node id="309245115" lat="48.074924" lon="11.6531406" version="6" timestamp="2014-02-03T21:13:35Z" changeset="20361115" uid="8748" user="ToniE">'...
'<tag k="power" v="substation"/>'...
'<tag k="source" v="survey"/>'...
'<tag k="operator" v="Energieversorgung Ottobunn"/>'...
'</node>'...
'<node id="309424891" lat="52.5676698" lon="13.0440382" version="4" timestamp="2015-03-08T19:18:44Z" changeset="29337113" uid="2149159" user="bergaufsee">'...
'<tag k="power" v="substation"/>'...
'</node>'];
I would like to filter only those nodes which have various tags following them.
i.e. My three outputs/matches should be these
output1:
<node id="309134964" lat="48.0685823" lon="11.6592565" version="4" timestamp="2015-02-16T12:52:33Z" changeset="28884856" uid="8748" user="ToniE">'...
'<tag k="power" v="substation"/>'...
'<tag k="source" v="survey"/>'...
'<tag k="operator" v="Energieversorgung Ottobrunn"/>
output2:
<node id="309245115" lat="48.074924" lon="11.6531406" version="6" timestamp="2014-02-03T21:13:35Z" changeset="20361115" uid="8748" user="ToniE">'...
'<tag k="power" v="substation"/>'...
'<tag k="source" v="survey"/>'...
'<tag k="operator" v="Energieversorgung Ottobunn"/>
output3:
<node id="309424891" lat="52.5676698" lon="13.0440382" version="4" timestamp="2015-03-08T19:18:44Z" changeset="29337113" uid="2149159" user="bergaufsee">'...
'<tag k="power" v="substation"/>
I have used this regular expression for the match
substation_nodes = regexp(text, '(<node.*?\">(.|\n)*?)(?=<\/node>)','match');
In Matlab when I run this code I have a problem getting the above outputs. The first and third outputs are as required but my second output looks like this instead
output 2:
<node id="309202573" lat="49.0064035" lon="9.1332687" version="6" timestamp="2015-08-09T09:24:34Z" changeset="33215175" uid="2672520" user="Stingray80"/>'...
'<node id="309209816" lat="47.9344289" lon="11.1041431" version="1" timestamp="2008-11-01T19:21:22Z" changeset="651519" uid="39150" user="account_deleted_1011"/>'...
'<node id="309209818" lat="47.9335507" lon="11.103726" version="2" timestamp="2014-07-30T20:48:19Z" changeset="24451882" uid="12096" user="HCX Biker"/>'...
'<node id="309209819" lat="47.9333751" lon="11.1045838" version="2" timestamp="2011-03-24T18:45:17Z" changeset="7658567" uid="313675" user="alphax"/>'...
'<node id="309209822" lat="47.9339823" lon="11.1047609" version="1" timestamp="2008-11-01T19:21:22Z" changeset="651519" uid="39150" user="account_deleted_1011"/>'...
'<node id="309209824" lat="47.9342688" lon="11.1048045" version="1" timestamp="2008-11-01T19:21:22Z" changeset="651519" uid="39150" user="account_deleted_1011"/>'...
'<node id="309245115" lat="48.074924" lon="11.6531406" version="6" timestamp="2014-02-03T21:13:35Z" changeset="20361115" uid="8748" user="ToniE">'...
'<tag k="power" v="substation"/>'...
'<tag k="source" v="survey"/>'...
'<tag k="operator" v="Energieversorgung Ottobunn"/>
There is an overlapping of the previous nodes in my output when I only need the last node id (i.e. node id=309245115). I have noticed that when I use regex101.com or regexr.com it works fine as long as I use the /g global modifier. I understand that /g expression flag retains the index of the last match, allowing iterative searches. Is this possible in Matlab? Do I have to use g-modifier explicitly in Matlab? What is the equivalent expression flag in Matlab
Or is the problem not even related to global modifier? I am clueless regarding the source of the problem
Could someone please help me out here. I am new to Matlab and regular expressions and not able to figure this out!! Thanks in advance

Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!