Regex assistance: include/exclude - regex

Hello I am trying to figure out this RegEx expression. I have a URL that can have different querystring parameter at different location.
test.aspx?foo=bar&abc=123
test.aspx?abc=123&foo=bar
test.aspx?foo=bar&abc=123#T1
test.aspx?abc=123&foo=bar#T2
I am trying to only find the one without the #Tnumber
Here what I have so far.
test.aspx\?(?!\#T[0-9])
However it still select all of them, is there a way to have a string constant and scan it down the line?
Juniorflip

If #Tnum is always at the end, you just need to do a bit of anchoring. For example, like this:
test.aspx\?.*(?!\#T[0-9])...$
But that's very fragile as it depends on the bad URLs always ending in a very particular form and good URLs always having enough characters to soak up that end matching. A negative lookbehind assertion is somewhat better, but still fragile and less commonly supported:
test.aspx\?.*(?<!\#T[0-9])$
It's better to write a regular expression that matches what you don't want and to just invert the logic of what to do when you get a match (i.e., "if it matches throw it away", instead of "if it matches use it"). But really it's much better to delegate the parsing of the URLs to a specialized library and then just do a simpler check against the fragment identifier as a logical component instead of as a horrible RE hack.

Related

find string that is missing substring in xml files regular expression

This is my reg expression that find it
(<instance_material symbol="material_)([0-9]+)(part)(.*?)(")(/)(>)
I need to find a string that does not contain the word "part"
and the xml lines are
<instance_material symbol="material_677part01_h502_w5" target="#material_677part01_h502_w5"/>
<instance_material symbol="material_677" target="#material_677"/>
You can use negative lookahead
^(?!.*part).*?$
^ - start of string.
(?!.*part) - condition to avoid part.
.*? - Match anything except new line.
$ - End of string
Demo
Many regex starters will encounter the problem finding a string not containing certain words. You could find more useful tips on Regular-Expression.info.
^((?!part).)*$
You need to be aware that all attempts to process XML using regular expressions are wrong, in the sense that (a) there will be some legitimate ways of writing the XML document that the regex doesn't match, and (b) there will be some ways of getting false matches, e.g. by putting nasty stuff in XML comments. Sometimes being right 99% of the time is OK of course, but don't do this in production because soon we'll have people writing on SO "I need to generate XML with the attributes in a particular order because that's what the receiving application requires."
Your regex, for example, requires the attribute to be in double rather than single quotes, and it doesn't allow whitespace around the "=" sign, or in several other places where XML allows whitespace. If there's any risk of people deliberately trying to defeat your regex, you need to consider tricks like people writing p in place of p.
Even if this is a one-off with no risk of malicious subversion, you're much better off doing this with XPath. It then becomes a simple query like //instance_materal[#symbol[not(contains(., 'part'))]]

How to store regex "literals" in Postgres?

I want to store regex pattern/option "literals" in a Postgres database, like:
/<pattern>/options
I think it's helpful to indicate the expected format and use of the text. Also, the application framework I'm using can coerce this kind of text into the proper Regex type.
I looked through the data types and provided extensions and didn't see anything specific. Am I missing one?
If there is no specialized type, is there a reasonable way to constrain TEXT to likely contain a regex (not to validate the regex, just to ensure text between forward-slashes). Does this work?
pattern TEXT CONSTRAINT is_regex (pattern LIKE '/%/%')
At the moment, I'm only using these literals in application code, which is why the TEXT to Regex transformation is very helpful. At some point, I might get better at CTEs and transform them back to regular TEXT (without forward-slashes or options) to be used in Postgres pattern matching functions.
PostgreSQL doesn't offer such type (as of now), but generally speaking you have a few options to preserve database integrity (I can only assume you want this to avoid worrying that the data you read from the database fails your application, because it's not a valid regular expression).
Your best bet is (which you already figured out) is to use a CHECK constraint, one way or the other. If you plan to use this pattern in multiple places, I suggest you to use domain types. That way, you don't have to define these constraints at multiple columns. Ironically the best way to write such a CHECK constraint is to write a regexp pattern to match your regexp patterns (because there are multiple regexp implementations with slight differences). It obviously won't be perfect, but it might be good enough. I.e.
create domain likely_regexp as text
check (value ~ '^/([^/]*(\\/[^/]*)*[^\\])?/[a-z]*$');
But if you're okay to check against PostgreSQL's implementation, you can (ab)use the fact that CHECK constraints fails not only when the evaluated expression is false, but they also fail when the expression throws (raises) some error. So you can call a regexp function in order to detect if it's actually a valid regular expression or not. Altough you still have to split the pattern and the options part.
create domain pg_regexp as text
check (regexp_replace('', replace(substring(value from '^/(.*)/'), '\/', '/'),
'', substring(value from '/([^/]*)$')) = '');
https://rextester.com/YFG18381

Regular expression/Regex with Java/Javascript: performance drop or infinite loop

I want here to submit a very specific performance problem that i want to understand.
Goal
I'm trying to validate a custom synthax with a regex. Usually, i'm not encountering performance issues, so i like to use it.
Case
The regex:
^(\{[^\][{}(),]+\}\s*(\[\s*(\[([^\][{}(),]+\s*(\(\s*([^\][{}(),]+\,?\s*)+\))?\,?\s*)+\]\s*){1,2}\]\s*)*)+$
A valid synthax:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]
You could find the regex and a test text here :
https://regexr.com/3jama
I hope that be sufficient enough, i don't know how to explain what i want to match more than with a regex ;-).
Issue
Applying the regex on valid text is not costing much, it's almost instant.
But when it comes to specific not valid text case, the regexr app hangs. It's not specific to regexr app since i also encountered dramatic performances with my own java code or javascript code.
Thus, my needs is to validate all along the user is typing the text. I can even imagine validating the text on click, but i cannot afford that the app will be hanging if the text submited by the user is structured as the case below, or another that produce the same performance drop.
Reproducing the issue
Just remove the trailing "]" character from the test text
So the invalid text to raise the performance drop becomes:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4
Another invalid test could be, and with no permformance drop:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]]
Request
I'll be glad if a regex guru coming by could explain me what i'm doing wrong, or why my use case isn't adapted for regex.
This answer is for the condensed regex from your comment:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+\,?)+\))?\,?)+\]){1,2}\])*)+$
The issues are similar for your original pattern.
You are facing catastrophic backtracking. Whenever the regex engine cannot complete a match, it backtracks into the string, trying to find other ways to match the pattern to certain substrings. If you have lots of ambiguous patterns, especially if they occur inside repetitions, testing all possible variations takes a looooong time. See link for a better explanation.
One of the subpatterns that you use is the following (multilined for better visualisation):
([^\][{}(),]+
(\(
([^\][{}(),]+\,?)+
\))?
\,?)+
That is supposed to match a string like actor4(syno3, syno4). Condensing this pattern a little more, you get to ([^\][{}(),]+,?)+. If you remove the ,? from it, you get ([^\][{}(),]+)+ which is an opening gate to the catasrophic backtracking, as string can be matched in quite a lot of different ways with this pattern.
I get what you try to do with this pattern - match an identifier - and maybe other other identifiers that are separated by comma. The proper way of doing this however is: ([^\][{}(),]+(?:,[^\][{}(),]+)*). Now there isn't an ambiguous way left to backtrack into this pattern.
Doing this for the whole pattern shown above (yes, there is another optional comma that has to be rolled out) and inserting it back to your complete pattern I get to:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+)*)\))?(?:\,[^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+))*\))?)*)\]){1,2}\])*)+$
Which doesn't catastrophically backtrack anymore.
You might want to do yourself a favour and split this into subpatterns that you concat together either using strings in your actual source or using defines if you are using a PCRE pattern.
Note that some regex engines allow the use of atomic groups and possessive quantifiers that further help avoiding needless backtracking. As you have used different languages in your title, you will have to check yourself, which one is available for your language of choice.

Regex Pattern Matching Concatenation

Is it possible to concatenate the results of Regex Pattern Matching using only Regex syntax?
The specific instance is a program is allowing regex syntax to pull info from a file, but I would like it to pull from several portions and concatenate the results.
For instance:
Input string: 1234567890
Desired result string: 2389
Regex Pattern match: (?<=1).+(?=4)%%(?<=7).+(?=0)
Where %% represents some form of concatenation syntax. Using starting and ending with syntax is important since I know the field names but not the values of the field.
Does a keyword that functions like %% exist? Is there a more clever way to do this? Must the code be changed to allow multiple regex inputs, automatically concatenating?
Again, the pieces to be concatenated may be far apart with unknown characters in between. All that is known is the information surrounding the substrings.
2011-08-08 edit: The program is written in C#, but changing the code is a major undertaking compared to finding a regex-based solution.
Without knowing exactly what you want to match and what language you're using, it's impossible to give you an exact answer. However, the usual way to approach something like this is to use grouping.
In C#:
string pattern = #"(?<=1)(.+)(?=4).+(?<=7)(.+)(?=0)";
Match m = Regex.Match(input, pattern);
string result = m.Groups[0] + m.Groups[1];
The same approach can be applied to many other languages as well.
Edit
If you are not able to change the code, then there's no way to accomplish what you want. The reason is that in C#, the regex string itself doesn't have any power over the output. To change the result, you'd have to either change the called method of the Regex class or do some additional work afterwards. As it is, the method called most likely just returns either a Match object or a list of matching objects, neither of which will do what you want, regardless of the input regex string.

Regex in URL Rewriting to match Querystring Parameter Values in any order?

Many URL rewriting utilities allow Regex matching. I need some URLs to be matched against a couple of main querystring parmeter values no matter what order they appear in. For example let's consider an URL having two key parameters ID= and Lang= in no specific order, and maybe some other non-key params are interspersed.
An Example URL to be matched with key params in any order:
http://www.example.com/SurveyController.aspx?ID=500&Lang=4 or
http://www.example.com/SurveyController.aspx?Lang=4&ID=500
Maybe with some interspersed non-key params:
http://www.example.com/SurveyController.aspx?Lang=3&ID=1&misc=3&misc=4 or
http://www.example.com/SurveyController.aspx?ID=1&misc=4&Lang=3 or
http://www.example.com/SurveyController.aspx?misc=4&Lang=3&ID=1 or
etc
Is there a good regex pattern to match against querystring param value in any order, or is it best to duplicate some rules, or in general should I look to other means?
Note: The main querystring values will also be captured using brackets i.e. ID=(3)&Lang=(500) and substituted into the destination URL, but that's not the focus of the question.
I would suggest parsing the query string into a dictionary and working from there, but if you want regex, you can use alternation+repetition to match in any order (without inlining all possible sequences). Python example:
>>> import re
>>> p = re.compile(r'(?:[?&](?:abc=([^&]*)|xyz=([^&]*)|[^&]*))+$')
>>> p.findall('x?abc=1&jjj=2&xyz=3')
[('1', '3')]
>>> p.findall('x?abc=1&xyz=3&jjj=2')
[('1', '3')]
>>> p.findall('x?xyz=3&abc=1&jjj=2')
[('1', '3')]
Regex matching depends highly on the sequential nature of a string. Position of the match is not important, but order definitely is.
This means you cannot write a regex pattern that matches its different parts in any arbitrary order. You can write a pattern that matches its parts in any pre-defined order, though - you would have to include every possible permutation in the pattern. This gets inconvenient very fast:
to match (a,b) you would need a,b|b,a
to match (a,b,c) you would need a,b,c|a,c,b|b,a,c|b,c,a|c,a,b|c,b,a
and so on
And this means you would best try to approach the problem sequentially, matching one parameter at a time. It depends on the capabilities of your rewriting engine how this would work.
This is outside of the capabilities of (most flavours of) regex. You would indeed need to duplicate each rewrite rule for every possible order of parameters, which is practical for two and... less practical for ten.
Also, regexes wouldn't do the kind of parsing you'd need to handle all possible parameter inputs. For example:
http://www.example.com/SurveyController.aspx?ID=500&L%61ng=4
would normally be a valid synonym, and
http://www.example.com/SurveyController.aspx?Hello=3&ID=400&Lang=4&ID=500
might often be a synonym for ID 400 or 500 depending on the parser. The simple regex matches might be OK if you are only wanting to 301 a load of deprecated old-format address to the shiny new one, but not enough if they are to catch all possible inputs.
So for more complex cases like this, you'd be better off having a real SurveyController.aspx that looks at its parameters and redirects you where you need to go.
If the underlying regular expression implementation understands both named groups and zero-width look-aheads you may be able to make something work, using something like aspx\?(?=ID=(?<ID>\d+))(?=Lang=(?<Lang>\d+)) (this is untested speculation), but the result is likely to be both unmaintainable and likely under-performs even a naive implementation that uses multiple regexes to parse the string.
I might suggest that query strings are best parsed by a simple tokenizer or even just split operations may be the best things for it.