regex finding elements in xml which contain attributes whose values contain two periods - regex

I'm searching some xml and my tool is regex. (my only tools in this case are editors so I"m using either eclipse or notepad++). I need to find all elements which contain attributes that have values containing two periods not adjacent.
so it would find attr1 and attr3 in this:
<myelement attr1 = "ab.cd.ef", attr2="ab", attr3="zy.sa.xa"/>
I've tried this and variations in notepad++
^(([^\"\.])*(\")[^\"\.]*[\.][^\"\.]*[\.][^\"\.]*[\"])+$
but it isn't picking up second attributes with values containing two periods.
I'm going to keep trying but if someone can point me to an answer I'd appreciate it.

I think you can't do this with regex.
Unless you create a monster regex that will create a blackhole swallowing all the life in the Earth (politely saying of course).
Bear in mind that you don't have logic in regex you just use pattern matching, for instance a number is just a number you can't say if I get 1 then get 3 also in a simple way.
You can use if then else in regex like:
(?(?=condition)(then1|then2|then3)|(else1|else2|else3))
But what you want to do is to nest if conditions with multiple conditions for each case, like if 1 then 3 | if 2 then 4 | if 3 then 5 creating an enormous pattern nested.
Another regex approach would be to have multiple regex lookarounds (look ahead in this case) what will do your regex impossible to read.
I think you might find more useful a Xpath or Xquery expressions for this. That it's a better approach to match xml than regex.

I'm searching some xml and my tool is regex.
That's a bit like saying that you are cutting down trees and your tool is a screwdriver. Get the right tool for the job: an XML parser and an XPath engine.

Related

Regular expression to find last match in XML output

I have been working for days to learn regex so that I can extract the last match out of an xml output of a test from a scientific instrument. The instrument buffer can hold multiple tests and I am only interested in the last (most recent) test. I can't figure it out!
<Ticket class="SAMPLE" serialno="6000SP210134" versions="FP6000;Main:V1.25;COM:V1.7;D:V1.11;TEC:V1.6">
<Measurement>
<SampleId>6</SampleId>
<DateTime>2022-10-28T15:16:22</DateTime>
<Value>300</Value>
<Unit>mOsmol/kg</Unit>
<DeviceCode>6000SP210134</DeviceCode>
<CheckSum>50c5656fd477cbcd3b7a5036ba98a542</CheckSum>
</Measurement>
</Ticket>
<Ticket class="SAMPLE" serialno="6000SP210134" versions="FP6000;Main:V1.25;COM:V1.7;D:V1.11;TEC:V1.6">
<Measurement>
<SampleId>7</SampleId>
<DateTime>2022-10-28T15:18:55</DateTime>
<Value>425</Value>
<Unit>mOsmol/kg</Unit>
<DeviceCode>6000SP210134</DeviceCode>
<CheckSum>50c5656fd477cbcd3b7a5036ba98a542</CheckSum>
</Measurement>
</Ticket>
I need match and return the last value from the last test <Ticket></Ticket> (the number of Tickets is variable). In this example it would be 425.
I thought this might work, but it doesn't...
\<Value>\d{2,4}<\/Value>.*\n$\
This regular expression is executed and interpreted in a lab information management system called LabVantage, not in any language like perl, php, C, etc. A regular expression is the only option I have.
LabVantage does not seem to publicly reveal their regex engine but if you have access to lookarounds then this should work:
<Value>\d{2,4}<\/Value>(?![\s\S]*<\/Value>)
<Value>\d{2,4}<\/Value> - you know what this does, you wrote it =)
(?![\s\S]*<\/Value>) - ahead of me, </Value> does not exist
https://regex101.com/r/XpbOdR/1
If lookbehinds are supported then you can get fancy like this to extract only the digits:
(?<=<Value>)\d{2,4}(?=<\/Value>(?![\s\S]*<\/Value>))
https://regex101.com/r/VCDURX/1
I was not able to coax LabVantage to work with a regular expression in the ways recommend above. However, if any LabVantage user is looking to solve a similar issue, the way it was resolved was to use a Value Extraction Rule like this:
extract /regex/ extract /regex/
or
extract /regex/ extract last number
This type of expression is not explicitly made a visible to the user but it still works. So the final code that did work is this:
extract /(?s).*Value>/ extract last number
Thanks all who contributed.

Problems with finding and replacing

Hey stackoverflow community. Ive need help with huge information file. Is it possible with regular expression to find in this tag:
<category_name><![CDATA[Prekiniai ženklai>Adler|Kita buitinė technika>Buičiai naudingi prietaisai|Kita buitinė technika>Lygintuvai]]></category_name>
Somehow replace all the other data and leave only 'Adler' or 'Lygintuvai'. Im using Altova to edit xml files, so i cant find other way then find-replace. And im new in the regex stuff. So i thought maby you can help me.
#\<category_name\>.+?gt\;([\w]+?)\|.+?gt;([\w]+?)\]\]\>\<\/category_name\>#i
\1 - Adler
\2 - Lygintuvai
PHP
regex101.com
Fields may contain alphanumeric characters without spaces.
If you want to modify the scope of acceptable characters change [\w] to something other:
[a-z] - only letters
[0-9] - only digits
etc.
It's possible, but use of regular expressions to process XML will never be 100% correct (you can prove that using computer science theory), and it may also be very inefficient. For example, the solution given by Luk is incorrect because it doesn't allow whitespace in places where XML allows it. Much better to use XQuery or XSLT, both of which are designed for the job (and both work in Altova). You can then use XPath expressions to locate the element or attribute nodes you are interested in, and you can still use regular expressions (e.g. in the XPath replace() function) to process the content of text or attribute nodes.
Incidentally, your input is rather strange because it uses escape sequences like > within a CDATA section; but XML escape sequences are not recognized in a CDATA section.

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:
<aaa>key0:val0 key1:val1 key2:va2</aaa>
I'd like to get back
key0:val0
key1:val1
key2:val2
So far I have
(?<=<aaa>).*(?=<\/aaa>)
Which will match everything inside, but as one result.
I also have
[^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:
key0:val0 key1:val1 key2:va2
But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.
Thanks!
You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".
So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.
If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.
But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.
I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:
(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)
AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:
<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>
But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

Efficiently match table of regex to a string

Typically, you have a regular expression and lots of strings to process.
I have the opposite. I have one string, and I want to find all the regular expressions that match it. Let's say I have 10 million regular expressions. I'm not trying to do any replacement or rewriting of the string, I just want to find things that match.
I'd like to store these in a database. A crude way to do this would be to select all ten million lines and iterate through them. For each iteration, apply the regex and somehow (I'm a little unclear on this piece too) see if it matches. Perhaps my regex library has a function which I give it a string and a regex, and it tells me if it matches. If it does, then I print out the regex.
This would be slow. I'm wondering if I can somehow hand this off to a database, so that it just returns me a table of the regular expressions that match a given string, out of its table of 10 million.
I'm agnostic on the database used, I'd just like it to be fast. I don't need it to be "custom assembler" fast but just "let the database figure it out so I don't have to iterate on 10 million lines" fast.
I'm wondering if I can somehow hand this off to a database, so that it just returns me a table of the regular expressions that match a given string
At least mysql can do this:
SELECT regex FROM table_with_regexes WHERE
regex REGEXP someString;
Also it would be helpful if you tell us more about your actual problem. I don't think you wrote ten millions regexes by hand, they must have been automatically generated - tell us how.
In your case, I would process in three steps:
Step 1 : Find a first sql query
Build a sql query that search for the regex matching my string.
I would start with a small regex set for building the sql query.
Step 2 : Refine it if nessary
Add more regexes and see how the sql query performs.
I would optimize, rewrite it if necessary here.
Step 3 : Use choosed database optimization tools
I would simply fine tune my sql query to respond as quickly as possible.
I would use hints for the sql engine, indices, parallel executions etc
Handing off all the hard work to the database is a good approach since IMO it's an elegant and clear approach.

How to enclose text patterns within xml elements, except when it is already inside a certain xml element?

I have several thousand xml files generated from java properties files prepared for translation in the TTX format. They contain quite a few variables, that I need to protect from the translators, as they often break such things. The variables are in the form of numbers or occasionally text between a pair of curly braces eg. {0}, {this}.
I need to surround these variables with an xml element if they are not already an attribute and if they are not already part of the inner text of a ut element, like so:
<ut DisplayText="{0}"><{0}></ut>
My input looks like this:
<ut Type="start"DisplayText="string"><string></ut> text string {0}
<ut DisplayText="{1}"><{1}></ut> in:
<ut DisplayText="\n"><\n/></ut> {2}.
<ut Type="end" DisplayText="resource"></resource></ut>
The correct output should be this:
<ut Type="start"DisplayText="string"><string></ut> text string <ut DisplayText="{0}">{0}</ut>
<ut DisplayText="{1}"><{1}></ut> in:
<ut DisplayText="\n"><\n/></ut> <ut DisplayText="{2}">{2}</ut>.
<ut Type="end" DisplayText="resource"></resource></ut>
My initial approach was to use a regular expression to match the term in the braces and just build the xml elements around it with pattern substitution. This approach fails when the pattern is present found as in the first code block above.
Previous find and replace patters (in notepad++):
Find
({[A-Za-z0-9]*})
Replace
<ut DisplayText="\1">\1</ut>
It is beginning to look like regex is not the right tool for the job, so I would like some suggestions on better approaches to take, different tools, or even just a more complete regex that may allow me to solve this quickly and repeatably.
Update: The problem turned out to be a little more complex than previously envisioned. It seems there are also a couple more things that needed protecting, involving some rather obscure syntax, mixing variables with text in what appears to be some kind of conditional statement. From memory:
{o,choice|1#1 error|1<{0,number,integer} errors}
Where "error" and "errors" are translatable and should not be protected. The simplest solution we have at present is to run the above regex, fix the odd few of erros it creates and then run a couple more normal find & replace passes for the more complex items. It could be abstracted out as regex, but right now there is not much point in doing that.
I appreciate the pointers to xslt and other editors with better regex support, in addition to the improved expressions offered. I will have a play with some of the options when time allows.
Let me know if my assumption is wrong, but from your example it seems you want to change text that is in {} and not in a <ut> element. To me this seems like an easy use of XSLT. Simply output UT elements as they are and process any text in between.
Why not try using the expression
(?<=.){[A-Za-z0-9]+}(?=.$)
This would find the { with 1 or more letters or numbers and the } when this pattern follows the tag and any number of spaces AND is followed by any number of spaces and a line break.
I ended up using a combination of the Regex in the question and manually fixing the odd error that caused. It wasn't ideal but it was quicker than trying to find the perfect solution.