Dynamic class operations within a regular expression - regex

I am trying to write a regex which excludes certain characters from a class based on the current content of capturing groups. The specific task that made me look for such a thing was to match lowercase letters in alphabetical order.
I searched through Rex's page (https://www.rexegg.com/regex-class-operations.html) to see if there was any way to change the class' content, but was unable to find anything.
Take the following attempt as a brief example: ([a-z])[a-z--[\1]]
Though it's not a correct regular expression, it demonstrates the concept. The idea is that it would match two letters that are not the same.
Note: the expression shown follows a Python-like syntax, and can also be written as:
([a-z])[a-z&&[\1]] or ([a-z])(?![\1])[a-z]
But I am going to use the Python syntax.
In the examples above the nested brackets are optional(in certain engines), but for the ultimate goal they are necessary. The pattern I am trying to match the ordered letters with would be something like this:
^(?:([a-z])([a-z--[a-(?(2)\2|\1)]])*+)?$
The first character class matches a letter which is immediately captured by the group, meaning that the letter will be excluded from the group containing the conditional. the first time the second group tries to match, condition inside the conditional statement evaluates to false, since there has not been a second capture yet, so it "matches" the first group's content, which should result in the exclusion of the first letter from the class. In later steps the second group will be set, meaning that all the letters between 'a' and the most recently captured letter will be excluded.
I know, it seems complicated. Maybe refactoring the pattern will help, take a look at this one:
^(?:([a-z])([(?(2)\2|\1)-z])*+)?$
This example makes no use of set operations, but the idea is roughly the same. The first group matches a letter, then the class inside the second group matches anything between the captured letter and 'z', which is noted by the [(?(2)\2|\1)-z] part. The conditional is there to ensure that the lower boundary of the character interval is the most recently captured character.
This could also be written using subroutine calls, but I doubt it would solve the problem. The issue might be that the classes are precompiled (and so are subroutines), so they cannot change during the matching process.
Are you guys aware of a workaround or an engine that supports such operations? I am interested in the dynamic class operation itself rather than a different way to match alphabetically ordered letters.

Related

Title Case Conversion in SPARQL Query with REGEX

I am one inch away from the solution to my problem. I am attempting title case conversion of strings retrieved via SPARQL. I am using the REPLACE function in combination with LCASE and REGEX:
BIND (replace(lcase(?label), "(\\b[a-z](?!\\s))", ucase("$1") ) as ?title_case)
lcase(?label): all characters in the string becomes lowercase
(\\b[a-z](?!\\s)): matches the first letter of each word in the string
ucase($1): is the backreference to the first letter matched, that act as replacement after turning it into UPPER case.
Expected Result: animal husbandry methods becomes Animal Husbandry Methods
That solution is working almost right, but not quite, for reasons beyond my comprehension; check here an example at work.
When you run the query you won't notice anything different in the ?title_case, but if you edit the ucase("$1") for ucase("aaa") you see it magically replacing correctly the first letter of each word:
Result: animal husbandry methods becomes AAAnimal AAAusbandry AAAethods
It seems to me the UCASE function does not have any affect on the backreference $1
Who can explain to me why so, and what is to do to rectify this behavior?
You can use SUBSTR{} function to solve the issue.
Eg: BIND (REPLACE(LCASE(?label), "(\\b[a-z](?!\\s))", UCASE(SUBSTR(?label, 1, 1)) ) as ?title_case)
Function calls in SPARQL follow traditional conventions of most programming languages, that is that the inner functions are evaluated first, and their return values are then given as arguments to the outer function. replace here takes 3 strings, the input string, the pattern, and the replacement. ucase is interpreted independently on how the result is used, it simply converts its argument to uppercase and, surprisingly, the uppercase of $1 is $1!
In other languages, what you'd usually do is use some overload of the function that accepts a function/expression instead of the string as the replacement, so that you could call anything from within. That is not possible in SPARQL, all the replace function can do is insert the capture unmodified.
I am afraid what you want to do is not perfectly achievable in SPARQL alone. Your options are:
Use a SPARQL extension that contains a function that makes it possible, if supported by the endpoint.
If your query is a part of a larger pipeline, convert the results in another way, for example using XSLT.
Since you only care about [a-z], you can simply expand out all the letters and replace them one by one: replace(replace(lcase(?label), "(\\ba(?!\\s))", "A" ), "(\\bb(?!\\s))", "B" ) and so on. Not a very elegant or performant solution, but it gets the job done.
A shorter option is to use a pattern like ^(.*?)(\b[a-z](?!\s))(.*)$ to split the string into 3 parts, which you can extract with replacements to $1, $2 and $3, respectively. Concatenate the first part with the uppercase of the second part, and repeat the whole process for the last part. You will again have to repeat the patterns, but this time it is the same pattern so there is a potential for optimization. A downside is that you have to end this "recursion" somewhere, so you can only replace a fixed number of words.

Is it possible to erase a capture group that has already matched, making it non-participating?

In PCRE2 or any other regex engine supporting forward backreferences, is it possible to change a capture group that matched in a previous iteration of a loop into a non-participating capture group (also known as an unset capture group or non-captured group), causing conditionals that test that group to match with their "false" clause rather than their "true" clause?
For example, take the following PCRE regex:
^(?:(z)?(?(1)aa|a)){2}
When fed the string zaazaa, it matches the whole string, as desired. But when fed zaaaa, I would like it to match zaaa; instead, it matches zaaaa, the whole string. (This is just for illustration. Of course this example could be handled by ^(?:zaa|a){2} but that is beside the point. Practical usage of capture group erasure would tend to be in loops that most often do far more than 2 iterations.)
An alternative way of doing this, which also doesn't work as desired:
^(?:(?:z()|())(?:\1aa|\2a)){2}
Note that both of these work as desired when the loop is "unrolled", because they no longer have to erase a capture that has already been made:
^(?:(z)?(?(1)aa|a))(?:(z)?(?(2)aa|a))
^(?:(?:z()|())(?:\1aa|\2a))(?:(?:z()|())(?:\3aa|\4a))
So instead of being able to use the simplest form of conditional, a more complicated one must be used, which only works in this example because the "true" match of z is non-empty:
^(?:(z?)(?(?!.*$\1)aa|a)){2}
Or just using an emulated conditional:
^(?:(z?)(?:(?!.*$\1)aa|(?=.*$\1)a)){2}
I have scoured all the documentation I can find, and there seems not to even be any mention or explicit description of this behavior (that captures made within a loop persist through iterations of that loop even when they fail to be re-captured).
It's different than what I intuitively expected. The way I would implement it is that evaluating a capture group with 0 repetitions would erase/unset it (so this could happen to any capture group with a *, ?, or {0,N} quantifier), but skipping it due to being in a parallel alternative within the same group in which it gained a capture during a previous iteration would not erase it. Thus, this regex would still match words iff they contain at least one of every vowel:
\b(?:a()|e()|i()|o()|u()|\w)++\1\2\3\4\5\b
But skipping a capture group due to it being inside an unevaluated alternative of a group that is evaluated with nonzero repetitions which is nested within the group in which the capture group took on a value during a previous iteration would erase/unset it, so this regex would be able to either capture or erase group \1 on every iteration of the loop:
^(?:(?=a|(b)).(?(1)_))*$
and would match strings such as aaab_ab_b_aaaab_ab_aab_b_b_aaa. However, the way forward references are actually implemented in existing engines, it matches aaaaab_a_b_a_a_b_b_a_b_b_b_.
I would like to know the answer to this question not merely because it would be useful in constructing regexes, but because I have written my own regex engine, currently ECMAScript-compatible with some optional extensions (including molecular lookahead (?*), i.e. non-atomic lookahead, which as far as I know, no other engine has), and I would like to continue adding features from other engines, including forward/nested backreferences. Not only do I want my implementation of forward backreferences to be compatible with existing implementations, but if there isn't a way of erasing capture groups in other engines, I will probably create a way of doing it in my engine that doesn't conflict with other existing regex features.
To be clear: An answer stating that this is not possible in any mainstream engines will be acceptable, as long as it is backed up by adequate research and/or citing of sources. An answer stating that it is possible would be much easier to state, since it would require only one example.
Some information on what a non-participating capture group is:
http://blog.stevenlevithan.com/archives/npcg-javascript - this is the article that originally introduced me to the idea.
https://www.regular-expressions.info/backref2.html - the first section on this page gives a brief explanation.
In ECMAScript/Javascript regexes, backreferences to NPCGs always match (making a zero-length match). In pretty much every other regex flavor, they fail to match anything.
I found this documented in PCRE's man page, under "DIFFERENCES BETWEEN PCRE2 AND PERL":
12. There are some differences that are concerned with the settings of
captured strings when part of a pattern is repeated. For example,
matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
unset, but in PCRE2 it is set to "b".
I'm struggling to think of a practical problem that cannot be better solved with an alternative solution, but in the interests of keeping it simple, here goes:
Suppose you have a simple task well-suited to being solved by using forward references; for example, check the input string is a palindrome. This cannot be solved generally with recursion (due to the atomic nature of subroutine calls), and so we bang out the following:
/^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$/
Easy enough. Now suppose we are asked to verify that every line in the input is a palindrome. Let's try to solve this by placing the expression in a repeated group:
\A(?:^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$(?:\n|\z))+\z
Clearly that doesn't work, since the value of \2 persists from the first line to the next. This is similar to the problem you're facing, and so here are a number of ways to overcome it:
1. Enclose the entire subexpression in (?!(?! )):
\A(?:(?!(?!^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$)).+(?:\n|\z))+\z
Very easy, just shove 'em in there and you're essentially good to go. Not a great solution if you want any particular captured values to persist.
2. Branch reset group to reset the value of capture groups:
\A(?|^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$|\n()()|\z)+\z
With this technique, you can reset the value of capture groups from the first (\1 in this case) up to a certain one (\2 here). If you need to keep \1's value but wipe \2, this technique will not work.
3. Introduce a group that captures the remainder of the string from a certain position to help you later identify where you are:
\A(?:^(?:(.)(?=.*(\1(?(2)(?=\2\3\z)\2))([\s\S]*)))*+.?\2$(?:\n|\z))+\z
The whole rest of the collection of lines is saved in \3, allowing you to reliably check whether you have progressed to the next line (when (?=\2\3\z) is no longer true).
This is one of my favourite techniques because it can be used to solve tasks that seem impossible, such as the ol' matching nested brackets using forward references. With it, you can maintain any other capture information you need. The only downside is that it's horribly inefficient, especially for long subjects.
4. This doesn't really answer the question, but it solves the problem:
\A(?![\s\S]*^(?!(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$))
This is the alternative solution I was talking about. Basically, "re-write the pattern" :) Sometimes it's possible, sometimes it isn't.
With PCRE (and all as I'm aware) it's not possible to unset a capturing group but using subroutine calls since their nature doesn't remember values from the previous recursion, you are able to accomplish the same task:
(?(DEFINE)((z)?(?(2)aa|a)))^(?1){2}
See live demo here
If you are going to implement a behavior into your own regex flavor to unset a capturing group, I'd strongly suggest do not let it happen automatically. Just provide some flags.
This is partially possible in .NET's flavour of regex.
The first thing to note is that .NET records all of the captures for a given capture group, not just the latest. For instance, ^(?=(.)*) records each character in the first line as a separate capture in the group.
To actually delete captures, .NET regex has a construction known as balancing groups. The full format of this construction is (?<name1-name2>subexpression).
First, name2 must have previously been captured.
The subexpression must then match.
If name1 is present, the substring between the end of the capture of name2 and the start of the subexpression match is captured into name1.
The latest capture of name2 is then deleted. (This means that the old value could be backreferenced in the subexpression.)
The match is advanced to the end of the subexpression.
If you know you have name2 captured exactly once then it can readily be deleted using (?<-name2>); if you don't know whether you have name2 captured then you could use (?>(?<-name2>)?) or a conditional. The problem arises if you might have name2 captured more than once since then it depends on whether you can organise enough repetitions of the deletion of name2. ((?<-name2>)* doesn't work because * is equivalent to ? for zero-length matches.)
There is also another way to "erase" capture groups in .NET. Unlike the (?<-name>) method, this empties the group instead of deleting it – so instead of not matching, it will then match an empty string.
In .NET, groups with the same name can be captured multiple times, even if that name is a number. This allows PCRE expressions using balanced groups to be ported to .NET. Consider this PCRE pattern:
(?|(pattern)|())
Assuming both groups are \1 above, then using this technique, in .NET it would become:
(?:(pattern)|(?<1>))
I used this technique today to make a 38 byte .NET regex that matches strings whose length is a fourth power:
^((?=(?>^((?<3>\3|x))|\3(\3\2))*$)){2}
the above is a port of the following 35 byte PCRE regex, which uses balanced groups:
^((?=(?|^((\2|x))|\2(\2\3))*+$)){2}
(In this example, the capture group isn't actually being emptied. But this technique can be used to do anything a balanced group can do, including emptying a group.)

How to invert an arbitrary Regex expression

This question sounds like a duplicate, but I've looked at a LOT of similar questions, and none fit the bill either because they restrict their question to a very specific example, or to a specific usercase (e.g: single chars only) or because you need substitution for a successful approach, or because you'd need to use a programming language (e.g: C#'s split, or Match().Value).
I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
For example, let's say I want to find the reverse of the Regex "over" in "The cow jumps over the moon", it would match The cow jumps and also match the moon.
That's only a simple example of course. The Regex could be something more messy such as "o.*?m", in which case the matches would be: The c, ps, and oon.
Here is one possible solution I found after ages of hunting. Unfortunately, it requires the use of substitution in the replace field which I was hoping to keep clear. Also, everything else is matched, but only a character by character basis instead of big chunks.
Just to stress again, the answer should be general-purpose for any arbitrary Regex, and not specific to any particular example.
From post: I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
The answer -
A match is Not Discontinuous, it is continuous !!
Each match is a continuous, unbroken substring. So, within each match there
is no skipping anything within that substring. Whatever matched the
regular expression is included in a particular match result.
So within a single Match, there is no inverting (i.e. match not this only) that can extend past
a negative thing.
This is a Tennant of Regular Expressions.
Further, in this case, since you only want all things NOT something, you have
to consume that something in the process.
This is easily done by just capturing what you want.
So, even with multiple matches, its not good enough to say (?:(?!\bover\b).)+
because even though it will match up to (but not) over, on the next match
it will match ver ....
There are ways to avoid this that are tedious, requiring variable length lookbehinds.
But, the easiest way is to match up to over, then over, then the rest.
Several constructs can help. One is \K.
Unfortunately, there is no magical recipe to negate a pattern.
As you mentioned it in your question when you have an efficient pattern you use with a match method, to obtain the complementary, the more easy (and efficient) way is to use a split method with the same pattern.
To do it with the pattern itself, workarounds are:
1. consuming the characters that match the pattern
"other content" is the content until the next pattern or the end of the string.
alternation + capture group:
(pattern)|other content
Then you must check if the capture group exists to know which part of the alternation succeeds.
"other content" can be for example described in this way: .*?(?=pattern|$)
With PCRE and Perl, you can use backtracking control verbs to avoid the capture group, but the idea is the same:
pattern(*SKIP)(*FAIL)|other content
With this variant, you don't need to check anything after, since the first branch is forced to fail.
or without alternation:
((?:pattern)*)(other content)
variant in PCRE, Perl, or Ruby with the \K feature:
(?:pattern)*\Kother content
Where \K removes all on the left from the match result.
2. checking characters of the string one by one
(?:(?!pattern).)*
if this way is very simple to write (if the lookahead is available), it has the inconvenient to be slow since each positions of the string are tested with the lookahead.
The amount of lookahead tests can be reduced if you can use the first character of the pattern (lets say "a"):
[^a]*(?:(?!pattern)a[^a]*)*
3. list all that is not the pattern.
using character classes
Lets say your pattern is /hello/:
([^h]|h(([^eh]|$)|e(([^lh]|$)|l(([^lh]|$)|l([^oh]|$))))*
This way becomes quickly fastidious when the number of characters is important, but it can be useful for regex flavors that haven't many features like POSIX regex.

htaccess regular expression explaination

I have been tasked with changing an .htaccess file. Unfortunately, I know very little about regular expressions, and so most of the file is unreadable for me. In particular, I have these two REs...
1: ^(?!((www|web3|web4|web5|web6|cm|test)\.mydomain\.com)|(?:(?:\d+\.){3}(?:\d+))$).*$
2: ^/([^/][^/])/([^/][^/])/([^/]+)/Job-Posting/$ /Misc/jobposting\.asp\?country=$1&state=$2&city=$3
For the first one, I understand the first half, more or less. it's trying to match against something that ISN'T www.mydomain.com, or web3.mydomain.com, etc., and that it may match that zero or one times. What I'm not clear on is what the second half of that does. My research suggests that ?: implies some sort of flag, but I didn't see any example that explained what exactly that meant. Please explain what this part means, as well as provide an example that would match it.
For the second one, the comments say this is applicable for a url containing /US/NY/Rochester/Job-Posting/. From this I can infer that ^/ means one character, but again, I couldnt find that in my research so far. What is the formal definition of ^/ ? What is the significance of putting it into square brackets [^/] ?
If I can get a handle on these two RE I should be able to adapt them to my needs. Your help is much appreciated.
?: doesn't match anything in particular, it modifies the behavior of the parenthesis. The ?: means the parenthesis are non-capturing, and thus cannot be referenced in the rule. Non capturing parens are good to use when you don't need to reference the captured text because the system doesn't have to 'remember' the text, which saves resources.
the code in question:
(?:(?:\d+\.){3}(?:\d+))
matches one or more digits followed by a period times three, then one or more digit. This will match IP addresses (ex 127.0.0.1). This will also match 123456.1.1.3456789, so you might want to restrict the number of characters allowed (?:(?:\d{1,3}.){3}(?:\d{1,3})), thought I haven't tested this so take it with a grain of salt.
Info on non capturing groupings.
The second item revolves around using square brackets as a character set. Square brackets match anything noted inside them, with ^ negating the match. So [ad02] will match any of the four characters a,d,0 or 2, while [^ad02] will match any character that is not a,d,0, or 2. So, ^/ means any character that is not /.
One of the tricky things about square brackets is the number of items they will match. [^/] will match one character, but so does [ad02]. It doesn't matter how many characters you have in the set, it still obeys the modifiers on the brackets. So [^/]{3} will match any series of 3 characters that does not contain a forward slash, while [^/]{2} will match a 2 character string with the same restriction.
For more info on character sets see Character Classes or Character Sets

Find strings that do not begin with something

This has been gone over but I've not found anything that works consistently... or assist me in learning where I've gone awry.
I have file names that start with 3 or more digits and ^\d{3}(.*) works just fine.
I also have strings that start with the word 'account' and ^ACCOUNT(.*) works just fine for these.
The problem I have is all the other strings that DO NOT meet the two previous criteria. I have been using ^[^\d{3}][^ACCOUNT](.*) but occasionally it fails to catch one.
Any insights would be greatly appreciated.
^[^\d{3}][^ACCOUNT](.*)
That's definitely not what you want. Square brackets create character classes: they match one character from the list of characters in brackets. If you put a ^ then the match is inverted and it matches one character that's not listed. The meaning of ^ inside brackets is completely different from its meaning outside.
In short, [] is not at all what you want. What you can do, if your regex implementation supports it, is use a negative lookahead assertion.
^(?!\d{3}|ACCOUNT)(.*)
This negative lookahead assertion doesn't match anything itself. It merely checks that the next part of the string (.*) does not match either \d{3} or ACCOUNT.
Demorgan's law says: !(A v B) = !A ^ !B.
But unfortunately Regex itself does
not support the negation of expressions. (You always could rewrite it, but sometimes, this is a huge task).
Instead, you should look at your Programming Language, where you can negate values without problems:
let the "matching" function be "match" and you are using match("^(?:\d{3}|ACCOUNT)(.)") to determine, whether the string matches one of both conditions. Then you could simple negate the boolean return value of that matching function and you'll receive every string that does NOT match.