Regex match anything that is not sub-pattern - regex

I have cookies in my HTTP header like so:
Set-Cookie: frontend=ovsu0p8khivgvp29samlago1q0; adminhtml=6df3s767g199d7mmk49dgni4t7; external_no_cache=1; ZDEDebuggerPresent=php,phtml,php3
and I need to extract the 26 character string that comes after frontend (e.g. ovsu0p8khivgvp29samlago1q0). The following regular expression matches that for me:
(?<=frontend=)(.*)(?=;)
However, I am using Varnish Cache and can only use a regex replace. Therefore, to extract that cookie value (26 character frontend string) I need to match all characters that do not match that pattern (so I can replace them with '').
I've done a fair bit of Googling but so far have drawn a blank. I've tried the following
Match characters that do not match the pattern I want: [^((?<=frontend=)[A-Za-z0-9]{26}(?=;))] which matches random characters, including the ones I want to preserve
I'd be grateful if someone could point me in the right direction, or note where I might have gone wrong.

The Set-Cookie response header is a bit magical in Varnish, since the backends tend to send multiple headers with the same name. This is prohibited by the RFC, but the defacto way to do it.
If you are using Varnish 3.0 you can use the Header VMOD, it can parse the response and extract what you need:
https://github.com/varnish/libvmod-header

Use regex pattern
^Set-Cookie:.*?\bfrontend=([^;]*)
and the "26 character string that comes after frontend" will be in group 1 (usually referred to in the replacement string as $1)

Do you have control over the replacement string? If so, you can go with Ωmega's answer, and use $1 in your replacement string to write the frontend value back.
Otherwise, you could use this:
^Set-Cookie:.*(?!frontend=)|(?<=frontend=.{26}).*$
This will match everything from the start of the string, until frontend= is encountered. Or it will match everything that has frontend= exactly 26 characters to the left of it and up until the end of the string. If those 26 characters are a variable length, it would get signigicantly more complicated, because only .NET supports variable-length lookbehinds.
For your last question. Let's have a look at your regex:
[^((?<=frontend=)[A-Za-z0-9]{26}(?=;))]
Well, firstly the negative character class [^...] you tried to surround you pattern with, doesn't really work like this. It is still a character class, so it matches only a single character that is not inside that class. But it gets even more complicated (and I wonder why it matches at all). So firstly the character class should be closed by the first ]. This character class matches anything that is not (, ?, <, =, ), a letter or a digit. Then the {26} is applied to that, so we are trying to find 26 of those characters. Then the (?=;) which asserts that those 26 characters are followed by ;. Now comes what should not work. The closing ) should actually throw and error. And the final ] would just be interpreted as a literal ].
There are some regex flavors which allow for nesting of character classes (Java does). In this case, you would simply have a character class equivalent to [^a-zA-Z0-9(){}?<=;]. But as far as I could google it, Varnish uses PCRE, and in PCRE your regex should simply not compile.

Related

Regex function wont match. How to write correctly

Trying to create Regex code that starts with # and then matches any alphanumeric character.
How do I make a code for Regex to match a line that starts with # and then followed by \w ?
I believe you want the following:
^#\w+
The ^ makes sure the string actually starts with the expression that follows.
We only check for one # character so we just place that there without anything else.
The \w includes all alphanumeric characters and underscores.
The + ensures there is at least one of the preceding token (at least one alphanumeric character in your case).
You should really try to attempt to learn expressions on your own though and try before you ask. I recommend this website to learn from: https://regexr.com/

Ant regex expression

Quite a simple one in theory but can't quite get it!
I want a regex in ant which matches anything as long as it has a slash on the end.
Below is what I expect to work
<regexp id="slash.end.pattern" pattern="*/"/>
However this throws back
java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
*/
^
I have also tried escaping this to \*, but that matches a literal *.
Any help appreciated!
Your original regex pattern didn't work because * is a special character in regex that is only used to quantify other characters.
The pattern (.)*/$, which you mentioned in your comment, will match any string of characters not containing newlines, however it uses a possibly unnecessary capturing group. .*/$ should work just as well.
If you need to match newline characters, the dot . won't be enough. You could try something like [\s\S]*/$
On that note, it should be mentioned that you might not want to use $ in this pattern. Suppose you have the following string:
abc/def/
Should this be evaluated as two matches, abc/ and def/? Or is it a single match containing the whole thing? Your current approach creates a single match. If instead you would like to search for strings of characters and then stop the match as soon as a / is found, you could use something like this: [\s\S]*?/.

Autohotkey: regex not behaving as expected

I want regex to match web addresses such as http://www.example.com, example.co.uk, en.example.com etc. I've been using ^(https?://|www\.|)[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$ and testing it on http://regexpal.com/, and it seems to work exactly as it should.
However, when I put it in autohotkey, it seems to match extra things like example and example.something, when it shouldn't. It then doesn't match things like example.com/something and example.com/something.html when it should.
If RegExMatch(Clipboard, "^(https?://|www\.|)[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$")
Msgbox, it matches
else
Msgbox, it doesn't
Matching URLs, host names etc is a problem solved many times; I suggest you adapt some standard regex. Perhaps SO question: Fully qualified domain name validation is helpful.
If you're composing the regex as an exercise:
Does it really match the string example? You firmly assert the string to contain a ., so it never should. Maybe AHK doesn't escape . the standard way?
If [a-zA-Z]{2,3} should match top level domain, you forgot about .info.
You may want to allow strings of whitespace of arbitrary length at the end and beginning, if you accidentally copied some such into the clipboard. I.e. ^\s*your-regex-thingy\s*$
example.something is a match, because it begins with the empty string, follows with a sequence of 1 or more alphanumerics (or -, .), one ., 2 or 3 letters, and ends with a sequence of non-whitespace.
example.com/something.html might fail to match if the entire substring example.com is matched by the group [a-zA-Z0-9\-\.]+. It shouldn't if the regex engine is correctly implemented, though. Perhaps you need to escape +, | or some such, engines have varying conventions on such (i.e. sed and pcre have differing opinions on + and ( if I'm not mistaken.

Regex to match any strings containing Cyrillic symbols, except comments marked with //, ///, ///, etc

I want to find all strings containing at least 1 Cyrillic character (basically /.*[А-я].*/) but with exception of comments.
Comment is a string or part of a string which starts with 2 or more / characters.
Currently I get this regex which do some part of the trick:
^(?=^.*?[А-я]+).*?((?=[\/]{2,})|(^(?:(?![\/]{2,}).)*$))
But I'd like to get less bloated and faster expression.
And as additional question: could anyone explain why this one is working? I combined it by trial-and-error but I'm not sure I completely understood how it works, because when I try to change it in any part - it stops working.
The following regex will match any cyrllic character that is not preceded by a double forward slash
(?<!/{2}.*)[А-я]
It specifies that it should not be preceded by a double slash by using a negative lookbehind.
You haven't specified what flavour of regex your using, but be aware some flavours don't support lookarounds. For example PCRE (javascript) doesn't. You are using 3 of them in your regex, so i presume its ok.

regex for links - help to understand it

how do you read this regex?
#(http|https|ftp)://([A-Z0-9][A-Z0-9_-]*(?:.[A-Z0-9][A-Z0-9_-]*)+):?(d+)?/?#i
this is a regex for links, but i'm having trouble to understand it
Thanks
Depending on what language you're in, regexes need a delimiter. Seems the # (pound sign or hash) is used here. So,
#...actual regex goes here...#
In javascript you need forward slashes (/..../).
Some regex engines allow you to pass flags that influence matching process. These appear after the closing delimiter:
#...actual regex goes here...#..flags go here..
In your example, there is one flag, the i and I am guessing that means: "case insensitive" (i for insensitive). Depending on the regex engine you can have flags that influence the syntax you can use for the actual regex (for example, the dot can match either any character or any character except newlines depending upon wheter a flag was passed), flags that influence how the matching is done (for example, in javascript a g indicates the global flag, and that means matching anywhere inside the string is done, and state is preserved), flags that determine whether whitespace is allowed as indentation inside the regex. And some have a m flag indicating whether the regex will be applied on a line by line basis, or on the entire text. There is AFAIK no standard set of flags, check your regex engine documentation.
If you have multiple flags, you just concatenate them together to a string of flags and put them after the closing delimiter.
Now for the actual regex. First, you start with a parenthesized expression:
(...group...)
This is also called a group. In many regex engines, these groups have special meaning, because when a match is found you can access the bits of text that matched the expression inside the group using a special variable (or sometimes, the match is returned as an array, where each element represents a group). If you can access the bits inside groups, it is called a "capturing group".
In this particular case the group uses "alternation" or "choice" and this is indicated by the | (pipe). The pipe is part of the regex syntax and means "or". So,
(http|https|ftp)
means: match "http", or if that doesn't match, "https", of if that doesn't match, "ftp". This also brings up another reason for using parenthesis: of all special regex syntax operators, the pipe has the lowest precedence, so the parenthesis would not have been there, it would have meant: match "http" or "https" or "ftp://...etc"
So far, we've seen these "special characters": | (pipe) and ( and ). After that we get
://
These are not special characters, and any non-special characters simply match themselves.
We then get another group, which makes up almost the rest of the regex:
([A-Z0-9][A-Z0-9_-]*(?:.[A-Z0-9][A-Z0-9_-]*)+)
Inside it, we see a bracketed expression:
[A-Z0-9]
The brackets [ and ] are special, and indicate a "character class". There are other ways to denote character classes, but in all cases a character class matches a single character. Which character depends on the nature of the class. In this case, the class is defined using two ranges:
A-Z
means characters A thru Z (and anything in between) and
0-9
means characters 0 thru 9 (and anything in between).
Basically, [A-Z0-9] matches any alpha-numeric character.
Note that the dash between the boundaries of the range is only a special character inside these bracketed expressions. Paradoxically, a dash inside the brackets can also simply mean a dash if it cannot be interpreted as a range.
This is folllowed by yet another character class:
[A-Z0-9_-]
Almost the same as the previous on, it just adds the underscore and the dash. This last dash cannot be interpreted as a range separator, so it simply means a dash. This character class will match any alpha-numeric character as well as underscore and dash.
This class is followed by a * (asterisk) and this is a special character indicating a cardinality. Cardinalities specify how often the immediately preceding element may occur. These are the common cardinalities:
* (asterisk) means zero or more times.
? (question mask) means zero or once.
+ (plus) means one or more times.
Now the entire bit starts to make sense:
[A-Z0-9][A-Z0-9_-]*
means: a sequence starting with one alphanumeric caracter, optionally followed by a string of "word" characters (that is, alphanumeric, dash and underscore).
The following bit of the regex is this:
(?:.[A-Z0-9][A-Z0-9_-]*)+
I think this is trying to match the domain parts. So that if you have say:
https://mail.google.com
The .google and .com bits would be matched by this part. The initial (?: bit is meant to tell the regex engine to not create a "backreference". This is not really my stronghold, maybe someone else can explain. But the rest of that group is quite clear and resembles what we saw before. I think there is a mistake though: the dot (.) that appears immediately before the bracketed character class usually means "match any character" or "match any non-newline character", not "match a literal dot". Typically if you want a literal dot, you need to escape it. This would be the syntax in javascript and I think perl:
(\.[A-Z0-9][A-Z0-9_-]*)+
(note the backslash immediately before the dot to indicate a literal dot)
The final bits of the regex seem an attempt to match a port number:
:?(d+)?
However, the d+ bit is probably wrong: right now it matches "one or more d's". It should probably be:
:?(\d+)?
meaning: optionally match a colon (:), optionally followed by a bunch of digits. The \d is also a character class, but a predefined one. I think most regex engines use \d to denote a digit, but you should check the documentation of your engine to see the exact convention. So in say:
http://domain.server.extension:8080/
this part of the regex would match :8080 (provided you fix the d+ thing).
Finally, we see
/?
Meaning the entire thing can be followed optionally by a forward slash.
So, all in all, I don;'t think this matches a "link", rather it matches the inital part of a URL. To match an entire url, you would need a bit more, at least I don't see any expression that could match the path, resource, hash and query bits that may occur in a proper URL.
When you say you have trouble understanding it, it means you tried something and are stuck somewhere?
Please ask more specific questions.
I can give you some keywords that you can lookup them more easy, a good place for that is regular-expressions.info
(http|https|ftp) is an alternation
[A-Z0-9] is a character class
*, + and ? are quantifiers
(...) is a (capturing) group, (?:...) is a non capturing group
The # at the start and end are regex delimiters, the i at the very end is a modifier/option (match case independent).
The (d+)? at the end would match one or more (optional) letters "d". This is quite strange. I assume it should be (\d+)? that would be one or more (optional) digits.