find string that is missing substring in xml files regular expression - regex

This is my reg expression that find it
(<instance_material symbol="material_)([0-9]+)(part)(.*?)(")(/)(>)
I need to find a string that does not contain the word "part"
and the xml lines are
<instance_material symbol="material_677part01_h502_w5" target="#material_677part01_h502_w5"/>
<instance_material symbol="material_677" target="#material_677"/>

You can use negative lookahead
^(?!.*part).*?$
^ - start of string.
(?!.*part) - condition to avoid part.
.*? - Match anything except new line.
$ - End of string
Demo

Many regex starters will encounter the problem finding a string not containing certain words. You could find more useful tips on Regular-Expression.info.
^((?!part).)*$

You need to be aware that all attempts to process XML using regular expressions are wrong, in the sense that (a) there will be some legitimate ways of writing the XML document that the regex doesn't match, and (b) there will be some ways of getting false matches, e.g. by putting nasty stuff in XML comments. Sometimes being right 99% of the time is OK of course, but don't do this in production because soon we'll have people writing on SO "I need to generate XML with the attributes in a particular order because that's what the receiving application requires."
Your regex, for example, requires the attribute to be in double rather than single quotes, and it doesn't allow whitespace around the "=" sign, or in several other places where XML allows whitespace. If there's any risk of people deliberately trying to defeat your regex, you need to consider tricks like people writing p in place of p.
Even if this is a one-off with no risk of malicious subversion, you're much better off doing this with XPath. It then becomes a simple query like //instance_materal[#symbol[not(contains(., 'part'))]]

Related

Regular expression/Regex with Java/Javascript: performance drop or infinite loop

I want here to submit a very specific performance problem that i want to understand.
Goal
I'm trying to validate a custom synthax with a regex. Usually, i'm not encountering performance issues, so i like to use it.
Case
The regex:
^(\{[^\][{}(),]+\}\s*(\[\s*(\[([^\][{}(),]+\s*(\(\s*([^\][{}(),]+\,?\s*)+\))?\,?\s*)+\]\s*){1,2}\]\s*)*)+$
A valid synthax:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]
You could find the regex and a test text here :
https://regexr.com/3jama
I hope that be sufficient enough, i don't know how to explain what i want to match more than with a regex ;-).
Issue
Applying the regex on valid text is not costing much, it's almost instant.
But when it comes to specific not valid text case, the regexr app hangs. It's not specific to regexr app since i also encountered dramatic performances with my own java code or javascript code.
Thus, my needs is to validate all along the user is typing the text. I can even imagine validating the text on click, but i cannot afford that the app will be hanging if the text submited by the user is structured as the case below, or another that produce the same performance drop.
Reproducing the issue
Just remove the trailing "]" character from the test text
So the invalid text to raise the performance drop becomes:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4
Another invalid test could be, and with no permformance drop:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]]
Request
I'll be glad if a regex guru coming by could explain me what i'm doing wrong, or why my use case isn't adapted for regex.
This answer is for the condensed regex from your comment:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+\,?)+\))?\,?)+\]){1,2}\])*)+$
The issues are similar for your original pattern.
You are facing catastrophic backtracking. Whenever the regex engine cannot complete a match, it backtracks into the string, trying to find other ways to match the pattern to certain substrings. If you have lots of ambiguous patterns, especially if they occur inside repetitions, testing all possible variations takes a looooong time. See link for a better explanation.
One of the subpatterns that you use is the following (multilined for better visualisation):
([^\][{}(),]+
(\(
([^\][{}(),]+\,?)+
\))?
\,?)+
That is supposed to match a string like actor4(syno3, syno4). Condensing this pattern a little more, you get to ([^\][{}(),]+,?)+. If you remove the ,? from it, you get ([^\][{}(),]+)+ which is an opening gate to the catasrophic backtracking, as string can be matched in quite a lot of different ways with this pattern.
I get what you try to do with this pattern - match an identifier - and maybe other other identifiers that are separated by comma. The proper way of doing this however is: ([^\][{}(),]+(?:,[^\][{}(),]+)*). Now there isn't an ambiguous way left to backtrack into this pattern.
Doing this for the whole pattern shown above (yes, there is another optional comma that has to be rolled out) and inserting it back to your complete pattern I get to:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+)*)\))?(?:\,[^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+))*\))?)*)\]){1,2}\])*)+$
Which doesn't catastrophically backtrack anymore.
You might want to do yourself a favour and split this into subpatterns that you concat together either using strings in your actual source or using defines if you are using a PCRE pattern.
Note that some regex engines allow the use of atomic groups and possessive quantifiers that further help avoiding needless backtracking. As you have used different languages in your title, you will have to check yourself, which one is available for your language of choice.

Problems with finding and replacing

Hey stackoverflow community. Ive need help with huge information file. Is it possible with regular expression to find in this tag:
<category_name><![CDATA[Prekiniai ženklai>Adler|Kita buitinė technika>Buičiai naudingi prietaisai|Kita buitinė technika>Lygintuvai]]></category_name>
Somehow replace all the other data and leave only 'Adler' or 'Lygintuvai'. Im using Altova to edit xml files, so i cant find other way then find-replace. And im new in the regex stuff. So i thought maby you can help me.
#\<category_name\>.+?gt\;([\w]+?)\|.+?gt;([\w]+?)\]\]\>\<\/category_name\>#i
\1 - Adler
\2 - Lygintuvai
PHP
regex101.com
Fields may contain alphanumeric characters without spaces.
If you want to modify the scope of acceptable characters change [\w] to something other:
[a-z] - only letters
[0-9] - only digits
etc.
It's possible, but use of regular expressions to process XML will never be 100% correct (you can prove that using computer science theory), and it may also be very inefficient. For example, the solution given by Luk is incorrect because it doesn't allow whitespace in places where XML allows it. Much better to use XQuery or XSLT, both of which are designed for the job (and both work in Altova). You can then use XPath expressions to locate the element or attribute nodes you are interested in, and you can still use regular expressions (e.g. in the XPath replace() function) to process the content of text or attribute nodes.
Incidentally, your input is rather strange because it uses escape sequences like > within a CDATA section; but XML escape sequences are not recognized in a CDATA section.

Regex exclude value from repetition

I'm working with load files and trying to write a regex that will check for any rows in the file that do not have the correct number of delimiters. Let's pretend the delimiter is % (I'm not sure this text field supports the delimiters that are in the load file). The regex I wrote that finds all rows that are correct is:
^%([^%]*%){20}$
Sometimes it's more beneficial to find any rows that do not have the correct number of delimiters, so to accomplish that, I wrote this:
(^%([^%]*%){0,19}$)|(^%([^%]*%){21,}$)
I'm concerned about the efficiency of this (and any regex I write in general), so I'm wondering if there's a better way to write it, or if the way I wrote it is fine. I thought maybe there would be some way to use alternation with the repetition tokens, such as:
{0,19}|{21,}
but that doesn't seem to work.
If it's helpful to know, I'm just searching through the files in Sublime Text, which I believe uses PCRE. I'm also open to suggestions for making the first regex better in general, although I've found it to work pretty well even in exceptionally large load files.
If your regex engine supports negative lookaheads, you can slightly modify your original regex.
^(?!%([^%]*%){20}$)
The regex above is useful for test only. If you want to capture, then you need to add .* part.
^(?!%([^%]*%){20}$).*$

Reg ex to match everything between two tags

I have a string similair to this
<td><p>alakjsdlajsdlkj</p><p><b>asdkjalsdkjaskldj</b></p><p>asdjlaksjdlaksjd</p></td>
What is the regular expression to grab everything between the tags?
I want to grab the following (including the HTML)
<p>alakjsdlajsdlkj</p><p><b>asdkjalsdkjaskldj</b></p><p>asdjlaksjdlaksjd</p>
You can't accomplish this with regular expressions. They just aren't descriptive/powerful enough, mainly in that there is no mechanism to keep track of how many of something it has seen. In short, this is because the regex mechanism has no notion of a stack (it represents finite state machines, not pushdown automata).
For example, consider the pattern <p>(.*)</p>. If you used greedy mode (match as much as possible) and have a string like <p>first</p><p>second</p>, the match will be first</p><p>second. If you went with non-greedy mode (make the smallest match possible) and get a string like <p><p>stuff</p></p>, you'll be rewarded with the match <p>stuff. So neither mode covers all cases (or any case) well.
As #kristopher points out, it's possible to have a pattern that avoids including another tag inside the match, but this will only match innermost tags.
To do what you want robustly, you'll need an actual parser. Several html parsing solutions have been created by others, or for simple parsing needs, you might be able to write your own.
if your tags nest this gets messy fast.
are you unable to use an html parser library? It would be FAR better to do so.
<([^>]+)>([^<]+)</\1>
gets you
any string wrapped in angle brackets
plus any characters up until the next tag
this doesn't handle nested or mismatched tags though
<div>test <b>nested</b></div>
will only catch the
< b >
not the div since the < div > will encounter the start of the < b > before encountering the end of its own tag.
try this, it should just match the outermost tags and return the inner string in the group
^<\w+>(.*)</\w+>$
But it does not checks for correct nesting etc. Use an appropriate framework if possible.
If you can't use an HTML parser and the td and ending td are at the beginning and end of the string:
^<td>(.*)</td>$

Regex assistance: include/exclude

Hello I am trying to figure out this RegEx expression. I have a URL that can have different querystring parameter at different location.
test.aspx?foo=bar&abc=123
test.aspx?abc=123&foo=bar
test.aspx?foo=bar&abc=123#T1
test.aspx?abc=123&foo=bar#T2
I am trying to only find the one without the #Tnumber
Here what I have so far.
test.aspx\?(?!\#T[0-9])
However it still select all of them, is there a way to have a string constant and scan it down the line?
Juniorflip
If #Tnum is always at the end, you just need to do a bit of anchoring. For example, like this:
test.aspx\?.*(?!\#T[0-9])...$
But that's very fragile as it depends on the bad URLs always ending in a very particular form and good URLs always having enough characters to soak up that end matching. A negative lookbehind assertion is somewhat better, but still fragile and less commonly supported:
test.aspx\?.*(?<!\#T[0-9])$
It's better to write a regular expression that matches what you don't want and to just invert the logic of what to do when you get a match (i.e., "if it matches throw it away", instead of "if it matches use it"). But really it's much better to delegate the parsing of the URLs to a specialized library and then just do a simpler check against the fragment identifier as a logical component instead of as a horrible RE hack.