Regex - Verify multiline content

Regex - Verify multiline content - regex

i am trying to verify a xml structure, where i want to check that the ns22:statement true tag is found after the postcode DataItem.
<ns21:DataItem name="country" default="false" />
<ns21:DataItem name="postcode" default="false">
<ns22:statement disabled>true</ns22:statement>
</ns21:DataItem>
I have tried this
(?m)\b.*:DataItem name="postcode" (?s)\b.*>$\n.*\bstatement disabled>true\b
but when changing postcode to country (where is supposed not to return anything) it catches all tags country, postcode and statement true.
I have also created this https://regexr.com/3quso
Any suggestions of how to get only the postcode+statement true??

XPath really does look like the best tool for the job given you're trying to validate XML structure as well as content. So, ignoring namespaces, you could use the following XPath in a soapUI XPath Match assertion:
boolean(//*[local-name()='DataItem'][#name='postcode']/*[local-name()='statement' and .='true'])
Also, in <ns22:statement disabled>true</ns22:statement>, is disabled meant to be part of the element name or an attribute? As it stands, it makes the XML invalid, so I've ignored it.
For good reasons not to use regular expressions to parse XML/HTML, see Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

Related

How to extract html attributes via regex

I am looking to see how a regex can be used to get attribute/values from an html tag. Yes I know that an xml/html parser can be used, but this is for testing my ability in regex. For example, in this html element:
<input name=dir value=">">
<input value=">" name=dir >
How would I extract out:
(?<name>...) and (?<value>...)
Is it possible once you have matched something to go "back" to the start of the match? For example:
<(?P<element>\w+).+(?:value="(?P<value>[^"])")####.+(?:name="(?P<name>[^"])")
Where #### basically means "go back to the start of the previous match/capture group (so that I don't have to modify every possible ordering of the tags). How could this be done?

Yes, using a parser is the best way.
As stated in the comments, you cannot (easily) extract all information in one sweep.
You can achieve what you want with several regexes:
input.*?name=(?'name'[^ ]+)
Test here.
input.*?value="(?'value'[^"]+)"
Test here.

REGEXP_LIKE to match xml tag content that is not like a specific string

I'm trying to do a regular expression matching with REGEXP_LIKE and I'm looking for a regexp to find if the value of a specific tag is not a specific string.
For example:
<person>
<name>John</name>
<age>40</age>
</person>
My goal is to validate that the name tag's value is not John, so the REGEXP_LIKE would return true for input xmls where name is not John.
Thank you in advance for the help!

A quick and easy way to do this is to simply negate the regex search:
... WHERE NOT REGEXP_LIKE('column_name', '<name>John</name>')
However, as should be mentioned every time a question like this is posted, it's generally a bad idea to parse XML with regex. If you find yourself constructing more complex regex patterns to search this XML data, then you should:
Use an XML parser instead of regular expressions, or
Change how you are storing the data! Make person.age a separate table column; don't bung the entire XML structure into a single place.

I am doing correlation in Jmeter. I am facing below issue to find the Regular expression

<input type="hidden" name="_csrf"
value="40ea7f46-799b-4ca0-b8cd-4adfba082aed" />
Above is the token I am getting in the request output. I am unable to replace this with a regular expression in Regular Expression Extractor of Jmeter.
<input type="hidden" name="_csrf" value="(.+?)" /> is not working.
Please help.

If your input actually contains a newline character, then you need to account for that in your regex. Furthermore, better be explicit about the valid characters in your regex, .+ is rarely a good thing:
<input type="hidden"\s+name="_csrf"\s+value="([^"]+)"\s*/>

you have to be careful with the spaces/newlines.
try with following simple regex:
value="(.*?)"\s/>
If it matches more than one element, to add uniquness, you can add name attribute in the regex as follows:
name="_csrf"\s+value="(.*?)"\s/>

This is another evidence for not using regular expressions to parse HTML as they are very fragile and sensitive to minimal markup changes. The more robust and resilient solution is using CSS/JQuery Extractor or XPath Extractor instead.
The relevant CSS Expression is input[name=_csrf], use value as "attribute"
The XPath query to get the value is //input[#name='_csrf']/#value
See How to Load Test CSRF-Protected Web Sites guide for detailed information on bypassing XSRF protection in JMeter tests

How to get string of everything between these two em tag?

I want to get string between em tag , including other html also.
for example:
<em>UNIVERSALPOSTAL UNION - International Bureau Circular<br />
By: K.J.S. McKeown</em>
output should be as:
UNIVERSALPOSTAL UNION - International Bureau Circular<br />
By: K.J.S. McKeown
please help me.
Thanks

Use the regular expression function like this:
REMatch("(?s)<em>.*?</em>", html)
See also: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=regexp_01.html
The (?s) sets the mode to single line, so that the input text is interpreted as one line even if it contains line feeds. This is probably the default (I'm not sure) so it can be omitted. As Peter pointed out in a comment, this is not the default and therefore must be set.
The .*? matches all characters inbetween <em> and </em>. The questionmark after the multiplier makes it "non-greedy", so that as few as possible characters are matched. This is needed in case the input html contains something like <em>foo</em><em>bar</em> where otherwise only the outermost <em></em> tags are considered.
The returned array contains all matches found, i.e. all texts including html that was in <em> tags.
Note that this could fail for circumstances where </em> also occurs as attribute text and is incorrectly not html-encoded, for example: <em><a title="Help for </em> tag">click</a></em> or in other rare circumstances (e.g. javascript script tags etc.). A regex cannot replace a full HTML/XML parser and if you need 100% accurateness, you should consider using one: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=functions_t-z_23.html

If your input is exactly in the format given above, you don't even need regex - just strip the outer tags:
<cfsavecontent variable="Input">[text from above]</cfsavecontent>
<cfset Output = mid( Input, 4 , len(Input) - 9 />
If your input is more than this (i.e. a significant piece of HTML, or a full HTML document), regex is still not the ideal tool - instead, you should be using a HTML parser, such as JSoup:
<cfset jsoup = createObject('java','org.jsoup.Jsoup') />
<cfset Output = jsoup.parse(Input).select('em').html() />
(With CF8, this code requires placing the jsoup JAR file in CF's lib directory, or using a tool such as JavaLoader.)

If you are using jquery you can do this also pretty easily.
$("em").html();
Will return all html between the em tags.
See this fiddle

I had to remove any text that was to follow after a partiucular tag . Now the HTML content was getting generated dynamically from a database that cater to 5 different langauges. so I only had the div tag to help me. I am not sure why REMatch("(?s).*?", html) did not work for me. However Ben helped me here (http://www.bennadel.com/blog/769-Learning-ColdFusion-8-REMatch-For-Regular-Expression-Matching.htm). My code looks like tghis:
<cfset extContentArr = REMatch("(?i)<div class=""inlineBlock"" style=""margin-right:30px;"">.+?</div>",qry_getContent.colval) />
<cfif !ArrayIsEmpty(extContentArr)>
Loop the array and do whatever you need with the extract , I just deleted them.
</cfif>

Looking for regex to erase href text

If I have a bunch of urls like this:
<li>Xyz 123</li>
<li>Xyz 345</li>
What would a regex look like to erase the urls inside the hrefs so that they become:
<li>Xyz 123</li>
<li>Xyz 345</li>

The following should do what you like:
/href=\"([^\"]*)\"/
Basically match href="<any text but a '"'>".

Search for <a href="[^"]*" and replace with <a href="".
If you add more details about which language you're using, I can be more specific. Be aware also that regular expressions are usually not the tool of choice when dealing with HTML.

First of all, do not use regex to parse HTML — why? Have a look here or here.
Process the HTML using an XML reader / XML document processing engine. Then use XPath to find nodes matching your criteria and alter href attributes in the DOM.
Note: For HTML which is not well-formed XML a more-general HTML (SGML) parser is required.

I partially agree with the others but a more complete version would be
/(<a[^>]+href\s*=\s*\")(.*?)("[^>]*>)/$1$3/gi

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex - Verify multiline content - regex

Related

How to extract html attributes via regex

REGEXP_LIKE to match xml tag content that is not like a specific string

I am doing correlation in Jmeter. I am facing below issue to find the Regular expression

How to get string of everything between these two em tag?

Looking for regex to erase href text

Categories

Resources