htaccess regular expression explaination - regex

I have been tasked with changing an .htaccess file. Unfortunately, I know very little about regular expressions, and so most of the file is unreadable for me. In particular, I have these two REs...
1: ^(?!((www|web3|web4|web5|web6|cm|test)\.mydomain\.com)|(?:(?:\d+\.){3}(?:\d+))$).*$
2: ^/([^/][^/])/([^/][^/])/([^/]+)/Job-Posting/$ /Misc/jobposting\.asp\?country=$1&state=$2&city=$3
For the first one, I understand the first half, more or less. it's trying to match against something that ISN'T www.mydomain.com, or web3.mydomain.com, etc., and that it may match that zero or one times. What I'm not clear on is what the second half of that does. My research suggests that ?: implies some sort of flag, but I didn't see any example that explained what exactly that meant. Please explain what this part means, as well as provide an example that would match it.
For the second one, the comments say this is applicable for a url containing /US/NY/Rochester/Job-Posting/. From this I can infer that ^/ means one character, but again, I couldnt find that in my research so far. What is the formal definition of ^/ ? What is the significance of putting it into square brackets [^/] ?
If I can get a handle on these two RE I should be able to adapt them to my needs. Your help is much appreciated.

?: doesn't match anything in particular, it modifies the behavior of the parenthesis. The ?: means the parenthesis are non-capturing, and thus cannot be referenced in the rule. Non capturing parens are good to use when you don't need to reference the captured text because the system doesn't have to 'remember' the text, which saves resources.
the code in question:
(?:(?:\d+\.){3}(?:\d+))
matches one or more digits followed by a period times three, then one or more digit. This will match IP addresses (ex 127.0.0.1). This will also match 123456.1.1.3456789, so you might want to restrict the number of characters allowed (?:(?:\d{1,3}.){3}(?:\d{1,3})), thought I haven't tested this so take it with a grain of salt.
Info on non capturing groupings.
The second item revolves around using square brackets as a character set. Square brackets match anything noted inside them, with ^ negating the match. So [ad02] will match any of the four characters a,d,0 or 2, while [^ad02] will match any character that is not a,d,0, or 2. So, ^/ means any character that is not /.
One of the tricky things about square brackets is the number of items they will match. [^/] will match one character, but so does [ad02]. It doesn't matter how many characters you have in the set, it still obeys the modifiers on the brackets. So [^/]{3} will match any series of 3 characters that does not contain a forward slash, while [^/]{2} will match a 2 character string with the same restriction.
For more info on character sets see Character Classes or Character Sets

Related

Dynamic class operations within a regular expression

I am trying to write a regex which excludes certain characters from a class based on the current content of capturing groups. The specific task that made me look for such a thing was to match lowercase letters in alphabetical order.
I searched through Rex's page (https://www.rexegg.com/regex-class-operations.html) to see if there was any way to change the class' content, but was unable to find anything.
Take the following attempt as a brief example: ([a-z])[a-z--[\1]]
Though it's not a correct regular expression, it demonstrates the concept. The idea is that it would match two letters that are not the same.
Note: the expression shown follows a Python-like syntax, and can also be written as:
([a-z])[a-z&&[\1]] or ([a-z])(?![\1])[a-z]
But I am going to use the Python syntax.
In the examples above the nested brackets are optional(in certain engines), but for the ultimate goal they are necessary. The pattern I am trying to match the ordered letters with would be something like this:
^(?:([a-z])([a-z--[a-(?(2)\2|\1)]])*+)?$
The first character class matches a letter which is immediately captured by the group, meaning that the letter will be excluded from the group containing the conditional. the first time the second group tries to match, condition inside the conditional statement evaluates to false, since there has not been a second capture yet, so it "matches" the first group's content, which should result in the exclusion of the first letter from the class. In later steps the second group will be set, meaning that all the letters between 'a' and the most recently captured letter will be excluded.
I know, it seems complicated. Maybe refactoring the pattern will help, take a look at this one:
^(?:([a-z])([(?(2)\2|\1)-z])*+)?$
This example makes no use of set operations, but the idea is roughly the same. The first group matches a letter, then the class inside the second group matches anything between the captured letter and 'z', which is noted by the [(?(2)\2|\1)-z] part. The conditional is there to ensure that the lower boundary of the character interval is the most recently captured character.
This could also be written using subroutine calls, but I doubt it would solve the problem. The issue might be that the classes are precompiled (and so are subroutines), so they cannot change during the matching process.
Are you guys aware of a workaround or an engine that supports such operations? I am interested in the dynamic class operation itself rather than a different way to match alphabetically ordered letters.

Why can "a*a+" and "(a{2,3})*a{2,3}" match "aaaa" while "(a{2,3})*" cannot?

My understanding of * is that it consumes as many characters as possible (greedily) but "gives back" when necessary. Therefore, in a*a+, a* would give one (or maybe more?) character back to a+ so it can match.
However, in (a{2,3})*, why doesn't the first "instance" of a{2,3} gives a character to the second "instance" so the second one can match?
Also, in (a{2,3})*a{2,3} the first part does seem to give a character to the second part.
A simple workaround for your question is to match aaaa with regex ^(a{2,3})*$.
Your problem is that:
In the case of (a{2,3})*, regex doesn't seem to consume as much
character as possible.
I suggest not to think in giving back characters. Instead, the key is acceptance.
Once regex accept your string, the matching will be over. The pattern a{2,3} only matches aa or aaa. So in the case of matching aaaa with (a{2,3})*, the greedy engine would match aaa. And then, it can't match more a{2,3} because there is only one a remained. Though it's able for regex engine to do backtrack and match an extra a{2,3}, it wouldn't. aaa is now accepted by the regex, thus regex engine would not do expensive backtracking.
If you add an $ to the end of the regex, it simply tells regex engine that a partly match is unacceptable. Moreover, it's easy to explain the (a{2,3})*a{2,3} case with accepting and backtracking.
The main problem is this:
My understanding of * is that it consumes as many characters as possible (greedily) but "gives back" when necessary
This is completely wrong. It is not what greedy means.
Greedy simply means "use the longest possible match". It does not give anything back.
Once you interpret the expressions with this new understanding everything makes sense.
a*a+ - zero or more a followed by one or more a
(a{2,3})*a{2,3} - zero or more of either two or three a followed by either two or three a (note: the KEY THING to remember is "zero or more", the first part not matching any character is considered a match)
(a{2,3})* - zero or more of either two or three a (this means that after matching three as the last single a left cannot match)
backtracking is done only if match fails however aaa is a valid match, a negative lookahead (?!a) can be use to prevent the match be followed by a a.
compare
(aaa?)*
and
(aaa?)*(?!a)

Find strings that do not begin with something

This has been gone over but I've not found anything that works consistently... or assist me in learning where I've gone awry.
I have file names that start with 3 or more digits and ^\d{3}(.*) works just fine.
I also have strings that start with the word 'account' and ^ACCOUNT(.*) works just fine for these.
The problem I have is all the other strings that DO NOT meet the two previous criteria. I have been using ^[^\d{3}][^ACCOUNT](.*) but occasionally it fails to catch one.
Any insights would be greatly appreciated.
^[^\d{3}][^ACCOUNT](.*)
That's definitely not what you want. Square brackets create character classes: they match one character from the list of characters in brackets. If you put a ^ then the match is inverted and it matches one character that's not listed. The meaning of ^ inside brackets is completely different from its meaning outside.
In short, [] is not at all what you want. What you can do, if your regex implementation supports it, is use a negative lookahead assertion.
^(?!\d{3}|ACCOUNT)(.*)
This negative lookahead assertion doesn't match anything itself. It merely checks that the next part of the string (.*) does not match either \d{3} or ACCOUNT.
Demorgan's law says: !(A v B) = !A ^ !B.
But unfortunately Regex itself does
not support the negation of expressions. (You always could rewrite it, but sometimes, this is a huge task).
Instead, you should look at your Programming Language, where you can negate values without problems:
let the "matching" function be "match" and you are using match("^(?:\d{3}|ACCOUNT)(.)") to determine, whether the string matches one of both conditions. Then you could simple negate the boolean return value of that matching function and you'll receive every string that does NOT match.

What do we need Lookahead/Lookbehind Zero Width Assertions for?

I've just learned about these two concepts in more detail. I've always been good with RegEx, and it seems I've never seen the need for these 2 zero width assertions.
I'm pretty sure I'm wrong, but I do not see why these constructs are needed. Consider this example:
Match a 'q' which is not followed by a 'u'.
2 strings will be the input:
Iraq
quit
With negative lookahead, the regex looks like this:
q(?!u)
Without it, it looks like this:
q[^u]
For the given input, both of these regex give the same results (i.e. matching Iraq but not quit) (tested with perl). The same idea applies to lookbehinds.
Am I missing a crucial feature that makes these assertions more valuable than the classic syntax?
Why your test probably worked (and why it shouldn't)
The reason you were able to match Iraq in your test might be that your string contained a \n at the end (for instance, if you read it from the shell). If you have a string that ends in q, then q[^u] cannot match it as the others said, because [^u] matches a non-u character - but the point is there has to be a character.
What do we actually need lookarounds for?
Obviously in the above case, lookaheads are not vital. You could workaround this by using q(?:[^u]|$). So we match only if q is followed by a non-u character or the end of the string. There are much more sophisticated uses for lookaheads though, which become a pain if you do them without lookaheads.
This answer tries to give an overview of some important standard situations which are best solved with lookarounds.
Let's start with looking at quoted strings. The usual way to match them is with something like "[^"]*" (not with ".*?"). After the opening ", we simply repeat as many non-quote characters as possible and then match the closing quote. Again, a negated character class is perfectly fine. But there are cases, where a negated character class doesn't cut it:
Multi-character delimiters
Now what if we don't have double-quotes to delimit our substring of interest, but a multi-character delimiter. For instance, we are looking for ---sometext---, where single and double - are allowed within sometext. Now you can't just use [^-]*, because that would forbid single -. The standard technique is to use a negative lookahead at every position, and only consume the next character, if it is not the beginning of ---. Like so:
---(?:(?!---).)*---
This might look a bit complicated if you haven't seen it before, but it's certainly nicer (and usually more efficient) than the alternatives.
Different delimiters
You get a similar case, where your delimiter is only one character but could be one of two (or more) different characters. For instance, say in our initial example, we want to allow for both single- and double-quoted strings. Of course, you could use '[^']*'|"[^"]*", but it would be nice to treat both cases without an alternative. The surrounding quotes can easily be taken care of with a backreference: (['"])[^'"]*\1. This makes sure that the match ends with the same character it began with. But now we're too restrictive - we'd like to allow " in single-quoted and ' in double-quoted strings. Something like [^\1] doesn't work, because a backreference will in general contain more than one character. So we use the same technique as above:
(['"])(?:(?!\1).)*\1
That is after the opening quote, before consuming each character we make sure that it is not the same as the opening character. We do that as long as possible, and then match the opening character again.
Overlapping matches
This is a (completely different) problem that can usually not be solved at all without lookarounds. If you search for a match globally (or want to regex-replace something globally), you may have noticed that matches can never overlap. I.e. if you search for ... in abcdefghi you get abc, def, ghi and not bcd, cde and so on. This can be problem if you want to make sure that your match is preceded (or surrounded) by something else.
Say you have a CSV file like
aaa,111,bbb,222,333,ccc
and you want to extract only fields that are entirely numerical. For simplicity, I'll assume that there is no leading or trailing whitespace anywhere. Without lookarounds, we might go with capturing and try:
(?:^|,)(\d+)(?:,|$)
So we make sure that we have the start of a field (start of string or ,), then only digits, and then the end of a field (, or end of string). Between that we capture the digits into group 1. Unfortunately, this will not give us 333 in the above example, because the , that precedes it was already part of the match ,222, - and matches cannot overlap. Lookarounds solve the problem:
(?<=^|,)\d+(?=,|$)
Or if you prefer double negation over alternation, this is equivalent to
(?<![^,])\d+(?![^,])
In addition to being able to get all matches, we get rid of the capturing which can generally improve performance. (Thanks to Adrian Pronk for this example.)
Multiple independent conditions
Another very classic example of when to use lookarounds (in particular lookaheads) is when we want to check multiple conditions on an input at the same time. Say we want to write a single regex that makes sure our input contains a digit, a lower case letter, an upper case letter, a character that is none of those, and no whitespace (say, for password security). Without lookarounds you'd have to consider all permutations of digit, lower case/upper case letter, and symbol. Like:
\S*\d\S*[a-z]\S*[A-Z]\S*[^0-9a-zA_Z]\S*|\S*\d\S*[A-Z]\S*[a-z]\S*[^0-9a-zA_Z]\S*|...
Those are only two of the 24 necessary permutations. If you also want to ensure a minimum string length in the same regex, you'd have to distribute those in all possible combinations of the \S* - it simply becomes impossible to do in a single regex.
Lookahead to the rescue! We can simply use several lookaheads at the beginning of the string to check all of these conditions:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[^0-9a-zA-Z])(?!.*\s)
Because the lookaheads don't actually consume anything, after checking each condition the engine resets to the beginning of the string and can start looking at the next one. If we wanted to add a minimum string length (say 8), we could simply append (?=.{8}). Much simpler, much more readable, much more maintainable.
Important note: This is not the best general approach to check these conditions in any real setting. If you are making the check programmatically, it's usually better to have one regex for each condition, and check them separately - this let's you return a much more useful error message. However, the above is sometimes necessary, if you have some fixed framework that lets you do validation only by supplying a single regex. In addition, it's worth knowing the general technique, if you ever have independent criteria for a string to match.
I hope these examples give you a better idea of why people would like to use lookarounds. There are a lot more applications (another classic is inserting commas into numbers), but it's important that you realise that there is a difference between (?!u) and [^u] and that there are cases where negated character classes are not powerful enough at all.
q[^u] will not match "Iraq" because it will look for another symbol.
q(?!u) however, will match "Iraq":
regex = /q[^u]/
/q[^u]/
regex.test("Iraq")
false
regex.test("Iraqf")
true
regex = /q(?!u)/
/q(?!u)/
regex.test("Iraq")
true
Well, another thing along with what others mentioned with the negative lookahead, you can match consecutive characters (e.g. you can negate ui while with [^...], you cannot negate ui but either u or i and if you try [^ui]{2}, you will also negate uu, ii and iu.
The whole point is to not "consume" the next character(s), so that it can be e.g. captured by another expression that comes afterwards.
If they're the last expression in the regex, then what you've shown are equivalent.
But e.g. q(?!u)([a-z]) would let the non-u character be part of the next group.

RegEx - Match Numbers of Variable Length

I'm trying to parse a document that has reference numbers littered throughout it.
Text text text {4:2} more incredible text {4:3} much later on
{222:115} and yet some more text.
The references will always be wrapped in brackets, and there will always be a colon between the two. I wrote an expression to find them.
{[0-9]:[0-9]}
However, this obviously fails the moment you come across a two or three digit number, and I'm having trouble figuring out what that should be. There won't ever be more than 3 digits {999:999} is the maximum size to deal with.
Anybody have an idea of a proper expression for handling this?
{[0-9]+:[0-9]+}
try adding plus(es)
What regex engine are you using? Most of them will support the following expression:
\{\d+:\d+\}
The \d is actually shorthand for [0-9], but the important part is the addition of + which means "one or more".
Try this:
{[0-9]{1,3}:[0-9]{1,3}}
The {1,3} means "match between 1 and 3 of the preceding characters".
You can specify how many times you want the previous item to match by using {min,max}.
{[0-9]{1,3}:[0-9]{1,3}}
Also, you can use \d for digits instead of [0-9] for most regex flavors:
{\d{1,3}:\d{1,3}}
You may also want to consider escaping the outer { and }, just to make it clear that they are not part of a repetition definition.