I have been handed a legacy xml which is not going to change.
In formatted way it looks like this:
<Result>
<StepSequence>
<RealMeasure>
<Text value="Batman"/>
</RealMeasure>
</StepSequence>
<StepSequence>
<RealMeasure>
<Text value="Superman"/>
</RealMeasure>
</StepSequence>
</Result>
Actually it comes like this:
<Result><StepSequence><RealMeasure><Text value="Batman"/></RealMeasure></StepSequence><StepSequence><RealMeasure><Text value="Superman"/></RealMeasure></StepSequence></Result>
Regex I have come up with is:
<RealMeasure><((\w*)\s+value="(.*)".*?)></RealMeasure>
But it is selecting data:
<RealMeasure><Text value="Batman"/></RealMeasure></StepSequence><StepSequence><RealMeasure><Text value="Superman"/></RealMeasure>
I want to select:
<RealMeasure><Text value="Batman"/></RealMeasure>
and
<RealMeasure><Text value="Superman"/></RealMeasure>
I want to get groups so that I can later convert the match to something like:
<RealMeasure type="Text" value="Superman"/>
using pattern like:
<RealMeasure type="$2" value=$3>
Link to online regex tester
Any tips to improve my regex?
Try this -
let reg = /<RealMeasure><((\w+)\s+value="(.*?)".*?)><\/RealMeasure>/g;
let str= `<Result><StepSequence><RealMeasure><Text value="Batman"/></RealMeasure></StepSequence><StepSequence><RealMeasure><Text value="Superman"/></RealMeasure></StepSequence></Result>`;
str.replace(reg, `<RealMeasure type="$2" value="$3"/>`); //<Result><StepSequence><RealMeasure type="Text" value="Batman"/></StepSequence><StepSequence><RealMeasure type="Text" value="Superman"/></StepSequence></Result>
The group value="(.*?)" has to be non-greedy as well. And changed the (\w*) to (\w+) to ensure that type is not empty.
Also, / in </RealMeasure> has to be escaped like <\/RealMeasure>.
I used the following regex:
<RealMeasure><(\w+).*?("[^"]*").*?<\/RealMeasure>
and it seems to be doing exactly what you want.
Test here. Detailed explanations are to the right-hand side of the page.
Please note that the software that you use might impose some limitations to the regex features that you can use.
Alternatively, use a proper XML parser to extract and reformat the data.
Related
I am looking to see how a regex can be used to get attribute/values from an html tag. Yes I know that an xml/html parser can be used, but this is for testing my ability in regex. For example, in this html element:
<input name=dir value=">">
<input value=">" name=dir >
How would I extract out:
(?<name>...) and (?<value>...)
Is it possible once you have matched something to go "back" to the start of the match? For example:
<(?P<element>\w+).+(?:value="(?P<value>[^"])")####.+(?:name="(?P<name>[^"])")
Where #### basically means "go back to the start of the previous match/capture group (so that I don't have to modify every possible ordering of the tags). How could this be done?
Yes, using a parser is the best way.
As stated in the comments, you cannot (easily) extract all information in one sweep.
You can achieve what you want with several regexes:
input.*?name=(?'name'[^ ]+)
Test here.
input.*?value="(?'value'[^"]+)"
Test here.
Response text from sampler is :
<input type="hidden" name="pid" value="PID_1498281212971253461">
The basic reg ex extractor mentioned for most of the correlations is (.+?). I have read the basics of the reg ex by googling and trying to understand reg ex better Base on the understanding, I tried Reg Ex (2nd Reg Ex) which I am not getting any matches.
Extractor1: RegEx1
Extractor2:RegEx2
Pls. help me in understanding. Appreciate your help.
This is my first post in any channel, pls ignore any comm errors.
You're almost there, your regular expression is basically missing a repetition meta character to wit +. In its current state it will match only something like <input type="hidden" name="pid" value="PD_1">
So you need to add + sign to the end of each character classes groups and your regular expression should start working as expected
References:
JMeter: Regular Expresions
Perl 5 Regex Cheat sheet
When it comes to parsing HTML responses using regular expressions is not the best option, you might want to consider using CSS/JQUery Extractor instead
You could use the XPath Extractor instead, will be simpler, here is the XPath to use
//*[#name='pid']/#value
Please make sure, you check the options, Use Tidy and Quiet in the XPath Extractor
I have an HTML parser doing the hard work, but I need a regex to select anchors that don't have an attriburte id="optout". Here's my current regex that selects all anchors that have href with http... this is great just needs to ignore those anchors with id="optout" -- any ideas?
Thanks!
<cfset matches = ReMatch('<a[^>]*href="http[^"]*"[^>]*>(.+?)</a>', arguments.htmlCode) />
Regex is the wrong tool for this task, and given that you've already got a HTML parser involved, there's no reason not to keep using it!
Here's the trivial way to do it with a HTML parser (jsoup):
jsoup.parse( Arguments.HtmlCode ).select('a:not([id=optout])')
Here's the far less maintainable regex way to do it:
rematch( '(?i)<a\s*(?:(?!id\s*=\s*[''"]optout[''"])[^>])+>(?:[^<]+|<(?!/a>))+</a>' , Arguments.HtmlCode )
I need to extract a value from a hidden HTML field, have somewhat figured it out but I'm currently stuck.
My regex looks like:
<input type="hidden" name="form_id" value=".*"
But this extracts the whole string from the HTML.
The string looks like:
<input type="hidden" name="form_id" value="123"/>
I need to extract the "value" from the string, it is always changing, but the "name" is always the same. Is there a way to extract it without doing another expression? I appreciate any help.
(?<=<[^<>]+?name="form_id"[^<>]+value=")(.*)(?=")
I just threw this together. Basically you want to negate any ending > in your request. So you'd likely want to do something of this nature:
<[^>]*hidden[^>]*value="(.*)"[^>]*>
And then read the first capture group (Delphi instructions). This keeps it as reasonably generic as possible although it does assume positional order on "hidden" and "value".
In order to find the value without regard for order you could use could use a slightly cleaner lookahead as was suggested:
^(?=.*name="form_id").*value="([^"]*)".*$
<[a-zA-Z"= _^>]*value="(\d*)"/>
I have tested this for your example.
If you want to extract for only input tag you can write:
<input[a-zA-Z"= _^>]*value="(\d*)"/>
I am writing temporary PHP script to update MySQL database of my vBulletin forum.
Here's what it does. It finds any entry that has a [youtube][/youtube] code. And then it has to replace that code with a link to the youtube video instead.
So, here is an example of what I have to take:
$string = <<<END
Hi everyone! Check out this video that I just found on YouTube!
[youtube]<object width="425" height="344"><param name="movie" value="http://www.youtube.com/v/Md1E_Rg4MGQ&hl=en&fs=1&"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/Md1E_Rg4MGQ&hl=en&fs=1&" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"></embed></object>[/youtube]
END;
And I have to make that look like this instead:
[URL=http://www.youtube.com/watch/?v=Md1E_Rg4MGQ]http://www.youtube.com/watch/?v=Md1E_Rg4MGQ[/URL]
I'm getting a headache working with the Regex. I don't have enough experience with Regex to figure out what to do.
It has to look something like this:
$string = preg_replace("#\[youtube\]?????\[/youtube\]#i", "[URL=http://www.youtube.com/watch?v=$1]http://www.youtube.com/watch?v=$1[/URL]", $string);
Help Please! ^_^
Something like this perhaps?
$string = preg_replace('#\[youtube\].*?name="movie" value="(.*?)".*?\[/youtube\]#i', "[URL=$1]$1[/URL]", $string);
Note that this is limited in that it expects a string name="movie" value="????????????????" format precisely - technically, there are other valid HTML constructs with the same meaning but different order, etc. The ideal solution would be to use some sort of DOM parser to grab the value of the movie attribute, but if you know people will always be using that exact format (i.e. if that's copy the copy/paste from youtube always is or the like) then regex can suffice.