How to write a regular expression pattern for this scenario - regex

I am trying to find the special character appearence in my below sample xml.
<?xml version="1.0"?>
<PayLoad>
<requestRows>****</requestRows>
<requestRowLength>1272</requestRowLength>
<exceptionTimestamp>2012070202281068-0700</exceptionTimestamp>
<exceptionTimestamp>201$2070202281068-0700</exceptionTimestamp>
<exceptionTimestamp>20120(702022810680700</exceptionTimestamp>
<exceptionDetail>NO DATA AVAILABLE FOR TIME PERIOD SPECIFIED =</exceptionDetail>
</PayLoad>
I have to find a entire tags that contain $,(,=,- characters. for this i have written below regular expression pattern
(<[\w\d]*>\w*(?<value>[^\w]+)\w*\d*</[\w\d]*>)
and it returns following output(running in Expresso Tool)
<requestRows>****</requestRows>
<exceptionTimestamp>2012070202281068-0700</exceptionTimestamp>
<exceptionTimestamp>20120(702022810680700</exceptionTimestamp>
but it should return below two enrty also.
<exceptionTimestamp>201$2070202281068-0700</exceptionTimestamp>
<exceptionDetail>NO DATA AVAILABLE FOR TIME PERIOD SPECIFIED =</exceptionDetail>
these entries omitted because it contains more than one special characters(including space). Can anyone please give me a correct regular expression for the above scenario.
Thanks.

I would use lookaround for the mid part, so instead of
(<[\w\d]*>\w*(?<value>[^\w]+)\w*\d*</[\w\d]*>)
I would use
(<[\w\d]*>(?=[^<]*[^<\w])(?<value>.*)</[\w\d]*>)
Without the ?<value> part that I don't really recognise the syntax of, this becomes
(<[\w\d]*>(?=[^<]*[^<\w]).*</[\w\d]*>)
Just add capturing groups where you like if you want to save anything in particular.

Related

Regular expression to find last match in XML output

I have been working for days to learn regex so that I can extract the last match out of an xml output of a test from a scientific instrument. The instrument buffer can hold multiple tests and I am only interested in the last (most recent) test. I can't figure it out!
<Ticket class="SAMPLE" serialno="6000SP210134" versions="FP6000;Main:V1.25;COM:V1.7;D:V1.11;TEC:V1.6">
<Measurement>
<SampleId>6</SampleId>
<DateTime>2022-10-28T15:16:22</DateTime>
<Value>300</Value>
<Unit>mOsmol/kg</Unit>
<DeviceCode>6000SP210134</DeviceCode>
<CheckSum>50c5656fd477cbcd3b7a5036ba98a542</CheckSum>
</Measurement>
</Ticket>
<Ticket class="SAMPLE" serialno="6000SP210134" versions="FP6000;Main:V1.25;COM:V1.7;D:V1.11;TEC:V1.6">
<Measurement>
<SampleId>7</SampleId>
<DateTime>2022-10-28T15:18:55</DateTime>
<Value>425</Value>
<Unit>mOsmol/kg</Unit>
<DeviceCode>6000SP210134</DeviceCode>
<CheckSum>50c5656fd477cbcd3b7a5036ba98a542</CheckSum>
</Measurement>
</Ticket>
I need match and return the last value from the last test <Ticket></Ticket> (the number of Tickets is variable). In this example it would be 425.
I thought this might work, but it doesn't...
\<Value>\d{2,4}<\/Value>.*\n$\
This regular expression is executed and interpreted in a lab information management system called LabVantage, not in any language like perl, php, C, etc. A regular expression is the only option I have.
LabVantage does not seem to publicly reveal their regex engine but if you have access to lookarounds then this should work:
<Value>\d{2,4}<\/Value>(?![\s\S]*<\/Value>)
<Value>\d{2,4}<\/Value> - you know what this does, you wrote it =)
(?![\s\S]*<\/Value>) - ahead of me, </Value> does not exist
https://regex101.com/r/XpbOdR/1
If lookbehinds are supported then you can get fancy like this to extract only the digits:
(?<=<Value>)\d{2,4}(?=<\/Value>(?![\s\S]*<\/Value>))
https://regex101.com/r/VCDURX/1
I was not able to coax LabVantage to work with a regular expression in the ways recommend above. However, if any LabVantage user is looking to solve a similar issue, the way it was resolved was to use a Value Extraction Rule like this:
extract /regex/ extract /regex/
or
extract /regex/ extract last number
This type of expression is not explicitly made a visible to the user but it still works. So the final code that did work is this:
extract /(?s).*Value>/ extract last number
Thanks all who contributed.

Problems with finding and replacing

Hey stackoverflow community. Ive need help with huge information file. Is it possible with regular expression to find in this tag:
<category_name><![CDATA[Prekiniai ženklai>Adler|Kita buitinė technika>Buičiai naudingi prietaisai|Kita buitinė technika>Lygintuvai]]></category_name>
Somehow replace all the other data and leave only 'Adler' or 'Lygintuvai'. Im using Altova to edit xml files, so i cant find other way then find-replace. And im new in the regex stuff. So i thought maby you can help me.
#\<category_name\>.+?gt\;([\w]+?)\|.+?gt;([\w]+?)\]\]\>\<\/category_name\>#i
\1 - Adler
\2 - Lygintuvai
PHP
regex101.com
Fields may contain alphanumeric characters without spaces.
If you want to modify the scope of acceptable characters change [\w] to something other:
[a-z] - only letters
[0-9] - only digits
etc.
It's possible, but use of regular expressions to process XML will never be 100% correct (you can prove that using computer science theory), and it may also be very inefficient. For example, the solution given by Luk is incorrect because it doesn't allow whitespace in places where XML allows it. Much better to use XQuery or XSLT, both of which are designed for the job (and both work in Altova). You can then use XPath expressions to locate the element or attribute nodes you are interested in, and you can still use regular expressions (e.g. in the XPath replace() function) to process the content of text or attribute nodes.
Incidentally, your input is rather strange because it uses escape sequences like > within a CDATA section; but XML escape sequences are not recognized in a CDATA section.

Last Matched String Issue

I'm using the following regular expression to pull out some html:
(?i)(?:\<tr\s*class='list'[^\>]*\>)[^$+]*\</tr\>
Problem is its not segregating the TRs correctly. I'm trying to use $+ to reference the tag selector again to ensure that the contents of the match don't have the start tag again. Here is the sample html:
http://www.pastie.org/1311827
There are multiple <tr>s in some matches. Please help.
I don't know what you think [^$+]* means, but it defines a negated character class that matches zero or more times. In other words, it matches an empty string, or one or more characters that aren't a literal dollar sign or plus.
HTML cannot be trivially parsed by regex (unless it is known ahead of time what the structure will look like) because in order to properly parse a document you need to be able to recurse, as elements within the document can be nested within themselves (for instance a <div> can contain another <div>). While some languages (you didn't specify what you're using) support recursive regular expressions (perl and PHP for instance), it would likely be more efficient to use a proper DOM parser than recursive regex (the complexity of which non-withstanding) anyways!
Use document.getElementsByTagName in your favorite DOM library and iterate through the nodeList with a loop, then parse the getAttribute('class').
I suggest not using regex because it's only a matter of time before the regex breaks, unless you're dealing with very trivial markup, in addition DOM is just made for that purpose.

Regex match an attribute value

What would the regular expression be to return 'details.jsp' (without quotes!) from this original tag. I can quite easily match all of value="details.jsp" but am having trouble just matching the contents within the attribute.
<s:include value="details.jsp" />
Any help greatly appreciated!
Thanks
Lawrence
/value=["']([^'"]+)/ would place "details.jsp" in the first capture group.
Edit:
In response to ircmaxell's comment, if you really need it, the following expression is more flexible:
/value=(['"])(.+)\1/
It will match things like <s:include value="something['else']">, but just note that the value will be placed in the second capture group.
But as mentioned before, regex is not what you want to use for parsing XML (unless it's a really simple case), so don't invest too much time into complex regexes when you should be using a parser.

Regular Expression find a phrase not inside an HTML tag

I'm struggling a bit with this regular expression and wondered if anyone was about to help me please?
What I need to do is isolate the 1st phrase inside a string which is NOT inside an HTML tag. So the examples I have at the moment are:
This is some test text about <acronym
title="Incomplete Test Syndrome"
class="CustomClass">ITS</acronym> for
the **ITS** department. Also worth
mentioning ABS as well I guess.ITS,
... and ...
This is some **ITS** test text about
<acronym title="Incomplete Test
Syndrome"
class="GOTManager">ITS</acronym> for
the ITS department. Also worth
mentioning ABS as well I guess
So in the first example I want it to ignore the wrapped ITS and give me the ITS at the end of the 1st sentence.
In the second example I want it to return the ITS at the start of the 2nd sentence.
The aim is to replace these with my own custom wrapped acronym tags in a ColdFusion application I'm writing.
Thanks a lot,
James
As the commentators have pointed out, regular expressions are not a good tool to work with XML/HTML-like texts. That is because being "inside" something is very hard to check for in any generality (you never know in which of these possible unlimited nesting levels you are).
For your particular examples, though, it possible to do. This heavily relies on not having any nested tags. If you do, you should seriously try a different approach.
Your examples work with
^(?:<[^<]*<[^>]*>|.)*?(ITS)
This matches the entire string up to the first occurance of ITS not in a tag (and has this in its first capturing group), but it should be easy to extract the data you need there. Only matching this instance of ITS is not possible, since your implementation of regular expressions does not support arbitrary length look-behinds.
Ask if you want/need the expression explained. =)
I will tell you the same thing I told you when you asked a very similar question:
Stuck with Regular Expression code to apply HTML tag to text but exclude if inside <?> tag
You CANNOT parse HTML, including nested elements, with pure regular expressions. This is a known limitation of regex and is well documented.
You can try installing and using an external regular expression engine with extensions, which might work. You can manually walk the string, counting the nesting to see if the string you are looking at is wrapped. You can use a genuine HTML parser, like WebKIT do do this externally.
But you can't do it with regex. Please look for an alternative. Heck, we'll even help.
You say:
The aim is to replace these with my
own custom wrapped acronym tags in a
ColdFusion application I'm writing.
It sounds like using XSL might be more appropriate than regex to transform one tag into another.
UPDATE:
Just threw this together, it seems to work for simple cases:
(NOTE: this will simply strip out the 'acronym' tags. You could use XSL to replace them with your own custom tags, but you didn't specify anything along those lines so I didn't get into that)
XSL:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="*[name() = 'acronym']" />
</xsl:stylesheet>
Input:
<?xml version="1.0" encoding="UTF-8"?>
<root>
This is some test text about <acronym
title="Incomplete Test Syndrome"
class="CustomClass">ITS</acronym> for
the **ITS** department. Also worth
mentioning ABS as well I guess.ITS,
This is some **ITS** test text about
<acronym title="Incomplete Test
Syndrome"
class="GOTManager">ITS</acronym> for
the ITS department. Also worth
mentioning ABS as well I guess
</root>
Output:
<?xml version="1.0" encoding="UTF-8"?>
This is some test text about for
the **ITS** department. Also worth
mentioning ABS as well I guess.ITS,
This is some **ITS** test text about
for
the ITS department. Also worth
mentioning ABS as well I guess
UPDATE:
You said:
So in the first example I want it to
ignore the wrapped ITS and give me the
ITS at the end of the 1st sentence.
In the second example I want it to
return the ITS at the start of the 2nd
sentence.
This makes no sense. Your second example doesn't have "ITS" in the second sentence. I think what you meant was that the **ITS** is what you want to have extracted.
The XSL sample I gave only strips the <acronym/> tags, but after that's done you can try to find the ITS at different points in the sentence and maybe for that a regex might be easy (this assumes that you're ONLY have to worry about the <acronym/> tags).