Regular expression to find last match in XML output - regex

I have been working for days to learn regex so that I can extract the last match out of an xml output of a test from a scientific instrument. The instrument buffer can hold multiple tests and I am only interested in the last (most recent) test. I can't figure it out!
<Ticket class="SAMPLE" serialno="6000SP210134" versions="FP6000;Main:V1.25;COM:V1.7;D:V1.11;TEC:V1.6">
<Measurement>
<SampleId>6</SampleId>
<DateTime>2022-10-28T15:16:22</DateTime>
<Value>300</Value>
<Unit>mOsmol/kg</Unit>
<DeviceCode>6000SP210134</DeviceCode>
<CheckSum>50c5656fd477cbcd3b7a5036ba98a542</CheckSum>
</Measurement>
</Ticket>
<Ticket class="SAMPLE" serialno="6000SP210134" versions="FP6000;Main:V1.25;COM:V1.7;D:V1.11;TEC:V1.6">
<Measurement>
<SampleId>7</SampleId>
<DateTime>2022-10-28T15:18:55</DateTime>
<Value>425</Value>
<Unit>mOsmol/kg</Unit>
<DeviceCode>6000SP210134</DeviceCode>
<CheckSum>50c5656fd477cbcd3b7a5036ba98a542</CheckSum>
</Measurement>
</Ticket>
I need match and return the last value from the last test <Ticket></Ticket> (the number of Tickets is variable). In this example it would be 425.
I thought this might work, but it doesn't...
\<Value>\d{2,4}<\/Value>.*\n$\
This regular expression is executed and interpreted in a lab information management system called LabVantage, not in any language like perl, php, C, etc. A regular expression is the only option I have.

LabVantage does not seem to publicly reveal their regex engine but if you have access to lookarounds then this should work:
<Value>\d{2,4}<\/Value>(?![\s\S]*<\/Value>)
<Value>\d{2,4}<\/Value> - you know what this does, you wrote it =)
(?![\s\S]*<\/Value>) - ahead of me, </Value> does not exist
https://regex101.com/r/XpbOdR/1
If lookbehinds are supported then you can get fancy like this to extract only the digits:
(?<=<Value>)\d{2,4}(?=<\/Value>(?![\s\S]*<\/Value>))
https://regex101.com/r/VCDURX/1

I was not able to coax LabVantage to work with a regular expression in the ways recommend above. However, if any LabVantage user is looking to solve a similar issue, the way it was resolved was to use a Value Extraction Rule like this:
extract /regex/ extract /regex/
or
extract /regex/ extract last number
This type of expression is not explicitly made a visible to the user but it still works. So the final code that did work is this:
extract /(?s).*Value>/ extract last number
Thanks all who contributed.

Related

Regex to match everything except a pattern

Regex noob here struggling with this, which I know it will be easy for some of you regex gods out there!
Given the following:
title: Some title
date: 2022-08-15
tags: <value to extract>
identifier: 1234567
---------------------------
Some text
some more text
I would like a regex to match everything except the value of tags (ie the "<value to extract>" text).
For context, this is supposed to run on emacs (in case it matters).
EDIT: Just to clarify as per #phils question, all I care about extracting the tags value. However, this is via a package setting that asks for a regex string and I don't have much control over how it gets use. It seems to expect a regex to strip what I don't need from the string rather than matching what I do want, which is slightly annoying.. Also, the since it seems to match everything with \\(.\\), I'm guessing it's using the global flag?
Please let me know if any of this isn't clear.
Emacs regular expressions can't trivially express "not foo" for arbitrary values of foo. (The likes of PCRE have non-regular extensions for zero-width negative look-ahead/behind assertions, but in Emacs that sort of functionality is generally done with the support of lisp code1.)
You can still do it purely with regexp matching, but it's simply very cumbersome. An Emacs regexp which matches any line which does not begin with tags: is:
^\(?:$\|[^t]\|t[^a]\|ta[^g]\|tag[^s]\|tags[^:]\).*
or if you need to enter it in the elisp double-quoted read syntax for strings:
"^\\(?:$\\|[^t]\\|t[^a]\\|ta[^g]\\|tag[^s]\\|tags[^:]\\).*"
1 In lisp code you would instead simply check each line to see whether it does start with tags: and, if so, skip it (which is why Emacs generally gets away without the feature you're looking for, but of course that doesn't help you here).
After playing around with it for a bit and taken inspiration from #phils' answer, I've come up with the following:
"^\\(?:\\(#\\+\\)?\\(?:filetags:\s+\\|tags:\s+\\|title:.*\\|identifier:.*\\|date:.*\\)\\|.*\\)"
I've also added an extra \\(#\\+\\)? to account for org meta keys which would usually have the format #+key: value.

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

regex finding elements in xml which contain attributes whose values contain two periods

I'm searching some xml and my tool is regex. (my only tools in this case are editors so I"m using either eclipse or notepad++). I need to find all elements which contain attributes that have values containing two periods not adjacent.
so it would find attr1 and attr3 in this:
<myelement attr1 = "ab.cd.ef", attr2="ab", attr3="zy.sa.xa"/>
I've tried this and variations in notepad++
^(([^\"\.])*(\")[^\"\.]*[\.][^\"\.]*[\.][^\"\.]*[\"])+$
but it isn't picking up second attributes with values containing two periods.
I'm going to keep trying but if someone can point me to an answer I'd appreciate it.
I think you can't do this with regex.
Unless you create a monster regex that will create a blackhole swallowing all the life in the Earth (politely saying of course).
Bear in mind that you don't have logic in regex you just use pattern matching, for instance a number is just a number you can't say if I get 1 then get 3 also in a simple way.
You can use if then else in regex like:
(?(?=condition)(then1|then2|then3)|(else1|else2|else3))
But what you want to do is to nest if conditions with multiple conditions for each case, like if 1 then 3 | if 2 then 4 | if 3 then 5 creating an enormous pattern nested.
Another regex approach would be to have multiple regex lookarounds (look ahead in this case) what will do your regex impossible to read.
I think you might find more useful a Xpath or Xquery expressions for this. That it's a better approach to match xml than regex.
I'm searching some xml and my tool is regex.
That's a bit like saying that you are cutting down trees and your tool is a screwdriver. Get the right tool for the job: an XML parser and an XPath engine.

How to write a regular expression pattern for this scenario

I am trying to find the special character appearence in my below sample xml.
<?xml version="1.0"?>
<PayLoad>
<requestRows>****</requestRows>
<requestRowLength>1272</requestRowLength>
<exceptionTimestamp>2012070202281068-0700</exceptionTimestamp>
<exceptionTimestamp>201$2070202281068-0700</exceptionTimestamp>
<exceptionTimestamp>20120(702022810680700</exceptionTimestamp>
<exceptionDetail>NO DATA AVAILABLE FOR TIME PERIOD SPECIFIED =</exceptionDetail>
</PayLoad>
I have to find a entire tags that contain $,(,=,- characters. for this i have written below regular expression pattern
(<[\w\d]*>\w*(?<value>[^\w]+)\w*\d*</[\w\d]*>)
and it returns following output(running in Expresso Tool)
<requestRows>****</requestRows>
<exceptionTimestamp>2012070202281068-0700</exceptionTimestamp>
<exceptionTimestamp>20120(702022810680700</exceptionTimestamp>
but it should return below two enrty also.
<exceptionTimestamp>201$2070202281068-0700</exceptionTimestamp>
<exceptionDetail>NO DATA AVAILABLE FOR TIME PERIOD SPECIFIED =</exceptionDetail>
these entries omitted because it contains more than one special characters(including space). Can anyone please give me a correct regular expression for the above scenario.
Thanks.
I would use lookaround for the mid part, so instead of
(<[\w\d]*>\w*(?<value>[^\w]+)\w*\d*</[\w\d]*>)
I would use
(<[\w\d]*>(?=[^<]*[^<\w])(?<value>.*)</[\w\d]*>)
Without the ?<value> part that I don't really recognise the syntax of, this becomes
(<[\w\d]*>(?=[^<]*[^<\w]).*</[\w\d]*>)
Just add capturing groups where you like if you want to save anything in particular.

Regex Pattern Matching Concatenation

Is it possible to concatenate the results of Regex Pattern Matching using only Regex syntax?
The specific instance is a program is allowing regex syntax to pull info from a file, but I would like it to pull from several portions and concatenate the results.
For instance:
Input string: 1234567890
Desired result string: 2389
Regex Pattern match: (?<=1).+(?=4)%%(?<=7).+(?=0)
Where %% represents some form of concatenation syntax. Using starting and ending with syntax is important since I know the field names but not the values of the field.
Does a keyword that functions like %% exist? Is there a more clever way to do this? Must the code be changed to allow multiple regex inputs, automatically concatenating?
Again, the pieces to be concatenated may be far apart with unknown characters in between. All that is known is the information surrounding the substrings.
2011-08-08 edit: The program is written in C#, but changing the code is a major undertaking compared to finding a regex-based solution.
Without knowing exactly what you want to match and what language you're using, it's impossible to give you an exact answer. However, the usual way to approach something like this is to use grouping.
In C#:
string pattern = #"(?<=1)(.+)(?=4).+(?<=7)(.+)(?=0)";
Match m = Regex.Match(input, pattern);
string result = m.Groups[0] + m.Groups[1];
The same approach can be applied to many other languages as well.
Edit
If you are not able to change the code, then there's no way to accomplish what you want. The reason is that in C#, the regex string itself doesn't have any power over the output. To change the result, you'd have to either change the called method of the Regex class or do some additional work afterwards. As it is, the method called most likely just returns either a Match object or a list of matching objects, neither of which will do what you want, regardless of the input regex string.