How does this regex for FQDNs (excluding.arpa) work?

How does this regex for FQDNs (excluding.arpa) work? - regex

I am trying to understand how regex works. I understand it little by little. However, I don't understand this one completely. It's basically a regex for fully qualified domain names but a requirement is that the ending can't be .arpa.
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
https://regex101.com/r/hU6tP0/3
This doesn't match google.uk. If I change it to:
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{1,63}[^.arpa]$)
It works again.
But this works as well
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
Here is my thought process for
?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
I see it as this
(?=
Is a positive look ahead (Can someone explain to me what this actually means?) As I understand it now, it just means that the string needs to match the regex.
^.{4,253}$)
Match all characters but it needs to be between 4 and 253 characters long.
(^([a-zA-Z0-9]{1,63}\.)
Start a capture group and make another capture group within. This capture group says that every non special character can be written 1 to 63 times or till the . is written.
+
The previous capture group can be repeated indefinitely, but it should always end with a .. This way the next capture group is started.
[a-zA-Z]{2,63}
Then as many times as you want you can write a to z with upper, but it needs to be between 2 and 63.
[^.arpa]$)
The last characters can't be .arpa.
Can someone tell me where I am going wrong?

This doesn't do what you think it does:
[^.arpa]
All that says is 'ends with something that isn't one of the letter apr.' - it's a negated character class.
You might be thinking of a negative lookahead assertion:
(?!\.arpa)$
But if you're trying to compound multiple criteria in a regex, I'd suggest you're probably using the wrong tool for the job. It ends up complicated and hard to debug, thanks to greedy/non-greedy matching, etc.
Your 'positive/negative' lookaheads are to match a piece of a pattern that aren't surrounded by other pieces of pattern. But that can have some unexpected outcomes if you're matching variable widths, because the regex engine will backtrack until it finds something that matches.
A simpler example:
([\w.]+)(?!arpa)$
Applied to:
www.test.arpa
Will it match? What's in the group?
... it will match, because [\w\.]+ will consume all of it, and then the lookahead won't "see" anything.
If you use:
([\w]+)\.(?!arpa)
Instead though - you'll capture.... www, but you won't match test (with e.g. g flag, because the www doesn't have .arpa after it, but the test does.
https://regex101.com/r/hU6tP0/5
It really does get complicated using negative assertions in a pattern as a result. I'd suggest simply not doing so, and applying two separate tests. It's hard for you to figure out, and it's hard for a future maintenance programmer too!

This is an analysis of your regex:
(?=^.{4,253}$) # force min length: 4 chars, max length: 253 chars
( # Capturing Group 1 (CG1) - not needed
^ # Match start of the string
( # CG2 (can be a non capturing group '(?:...)')
[a-zA-Z0-9]{1,63} # any sequence of letters and numbers with length between 1 and 63
\. # a literal dot
)+ # CLOSE CG2
[a-zA-Z]{1,63} # any letter sequence with length between 1 to 63
[^.arpa] # a negated char class: any char that is not a "literal" '.','a','r','p' (last 'a' is redundant)
$ # end of the string
) # CLOSE CG1
To avoid the tail of the string to be .arpa you need to use a negative lookahead (?!...), so modify just like this:
(?=^.{4,253}$)(?!.*\.arpa$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
An online demo
Update:
I've upgraded the regex to rationalise it (i've incorporated also the Sobrique suggestion adding an important details):
/^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i
Compact version online demo
Legenda
/ # js regex delimiter
^ # start of the string
(?=.{4,253}$) # force min length: 4 chars, max length: 253 chars
(?: # Non capturing group 1 (NCG1)
[a-z0-9]{1,63} # any letter or digit in a sequence with length from 1 to 63 chars
[.] # a literal dot '.' (more readable than \.)
)+ # CLOSE NCG1 - repeat its content one or more time
(?!arpa$) # force that after the last literal dot '.' the string does not end with 'arpa' (i've added '$' to Sobrique suggestion instead it prevents also '.arpanet' too)
[a-z]{2,63} # a sequence of letters with length from 2 to 63
$ # end of the string
/i # Close the regex delimiter and add case insensitive flag [a-z] match also [A-Z] and viceversa
var re = /^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i;
var tests = ['google.uk','domain.arpa','domain.arpa2','another.domain.arpa.net','domain.arpanet'];
var m;
while(t = tests.pop()) {
document.getElementById("r").innerHTML += '"' + t + '"<br/>';
document.getElementById("r").innerHTML += 'Valid domain? ' + ( (t.match(re)) ? '<font color="green">YES</font>' : '<font color="red">NO</font>') + '<br/><br/>';
}
<div id="r"/>

Related

PCRE Regex: Is it possible to check within only the first X characters of a string for a match

PCRE Regex: Is it possible for Regex to check for a pattern match within only the first X characters of a string, ignoring other parts of the string beyond that point?
My Regex:
I have a Regex:
/\S+V\s*/
This checks the string for non-whitespace characters whoich have a trailing 'V' and then a whitespace character or the end of the string.
This works. For example:
Example A:
SEBSTI FMDE OPORV AWEN STEM students into STEM
// Match found in 'OPORV' (correct)
Example B:
ARKFE SSETE BLMI EDSF BRNT CARFR (name removed) Academy Networking Event
//Match not found (correct).
Re: The capitalised text each letter and the letters placement has a meaning in the source data. This is followed by generic info for humans to read ("Academy Networking Event", etc.)
My Issue:
It can theoretically occur that sometimes there are names that involve roman numerals such as:
Example C:
ARKFE SSETE BLME CARFR Academy IV Networking Event
//Match found (incorrect).
I would like my Regex above to only check the first X characters of the string.
Can this be done in PCRE Regex itself? I can't find any reference to length counting in Regex and I suspect this can't easily be achieved. String lengths are completely arbitary. (We have no control over the source data).
Intention:
/\S+V\s*/{check within first 25 characters only}
ARKFE SSETE BLME CARFR Academy IV Networking Event
^
\- Cut off point. Not found so far so stop.
//Match not found (correct).
Workaround:
The Regex is in PHP and my current solution is to cut the string in PHP, to only check the first X characters, typically the first 20 characters, but I was curious if there was a way of doing this within the Regex without needing to manipulate the string directly in PHP?
$valueSubstring = substr($coreRow['value'],0,20); /* first 20 characters only */
$virtualCount = preg_match_all('/\S+V\s*/',$valueSubstring);

The trick is to capture the end of the line after the first 25 characters in a lookahead and to check if it follows the eventual match of your subpattern:
$pattern = '~^(?=.{0,25}(.*)).*?\K\S+V\b(?=.*\1)~m';
demo
details:
^ # start of the line
(?= # open a lookahead assertion
.{0,25} # the twenty first chararcters
(.*) # capture the end of the line
) # close the lookahead
.*? # consume lazily the characters
\K # the match result starts here
\S+V # your pattern
\b # a word boundary (that matches between a letter and a white-space
# or the end of the string)
(?=.*\1) # check that the end of the line follows with a reference to
# the capture group 1 content.
Note that you can also write the pattern in a more readable way like this:
$pattern = '~^
(*positive_lookahead: .{0,20} (?<line_end> .* ) )
.*? \K \S+ V \b
(*positive_lookahead: .*? \g{line_end} ) ~xm';
(The alternative syntax (*positive_lookahead: ...) is available since PHP 7.3)

You can find your pattern after X chars and skip the whole string, else, match your pattern. So, if X=25:
^.{25,}\S+V.*(*SKIP)(*F)|\S+V\s*
See the regex demo. Details:
^.{25,}\S+V.*(*SKIP)(*F) - start of string, 25 or more chars other than line break chars, as many as possible, then one or more non-whitespaces and V, and then the rest of the string, the match is failed and skipped
| - or
\S+V\s* - match one or more non-whitespaces, V and zero or more whitespace chars.

Any V ending in the first 25 positions
^.{1,24}V\s
See regex
Any word ending in V in the first 25 positions
^.{1,23}[A-Z]V\s

If pattern repeats two times (nonconsecutive) match both patterns, regex

I have 3 values that I'm trying to match. foo, bar and 123. However I would like to match them only if they can be matched twice.
In the following line:
foo;bar;123;foo;123;
since bar is not present twice, it would only match:
foo;bar;123;foo;123;
I understand how to specify to match exactly two matches, (foo|bar|123){2} however I need to use backreferences in order to make it work in my example.
I'm struggling putting the two concepts together and making a working solution for this.

You could use
(?<=^|;)([^\n;]+)(?=.*(?:(?<=^|;)\1(?=;|$)))
Broken down, this is
(?<=^|;) # pos. loobehind, either start of string or ;
([^\n;]+) # not ; nor newline 1+ times
(?=.* # pos. lookahead
(?:
(?<=^|;) # same pattern as above
\1 # group 1
(?=;|$) # end or ;
)
)
\b # word boundary
([^;]+) # anything not ; 1+ times
\b # another word boundary
(?=.*\1) # pos. lookahead, making sure the pattern is found again
See a demo on regex101.com.
Otherwise - as said in the comments - split on the ; programmatically and use some programming logic afterwards.
Find a demo in Python for example (can be adjusted for other languages as well):
from collections import Counter
string = """
foo;bar;123;foo;123;
foo;bar;foo;bar;
foo;foo;foo;bar;bar;
"""
twins = [element
for line in string.split("\n")
for element, times in Counter(line.split(";")).most_common()
if times == 2]
print(twins)

making sure to allow room for text that may occur in between matches with a ".*", this should match any of your values that occur at least twice:
(foo|bar|123).*\1

Regex that only matches on odd/even indices

Is there a regex that matches a string only when it starts on an odd or an even index? My use case is a hex string in which I want to replace certain "bytes".
Now, when trying to match 20 (space), 20 in "7209" would be matched as well even though it consists of the bytes 72 and 09. I am restricted to the regex implementation of Notepad++ in this case, so I'm not able to check the match index as e.g. in Java.
My sample input looks like:
324F8D8A20561205231920
I set up a testing page here, the regex should only match the first and the last occurence of 20, since the one in the middle starts on an odd index.

You can use the following regex to match 20 at even positions inside a hex string:
20(?=(?:[\da-fA-F]{2})*$)
See demo
I assume the string has no spaces in this case.
In case you have spaces between the values (or any other symbols), this could be an alternative (with $1XX-like replacement string):
((?:.{2})*?)20
See another demo

This seems to work for evens:
rx <- "^(.{2})*(20)"
strings <- c("7209","2079","9720")
grepl(rx,strings) # [1] FALSE TRUE TRUE

Not sure what Notepad++ uses for regex engine - it's been a while since I used it. This works in javascript...
/^(?:..)*?(20)/
...
/^ # start regex
(?: # non capturing group
.. # any character (two times)
)*? # close group, and repeat zero or more times, un-greedily
(20) # capture `20` in group
/ # end regex

How to write regular expression to find one or more digits separated by periods without returning the last period?

How to write regular expression to find between one and three digits separated by periods without returning the last period? For example, find the string
1.1.
and it would also need to match
1.1
or simply
1
Likewise, it needs to support between one and three digits, so
11.11.11
and
111.111.111
need to work as well.
So..the string won't always end with a period, but it may. Further, if it does end with a period, don't return the last period (so, using a positive lookahead). So, 1.1. if matched would return 1.1
Here is what I have so far, but I am struggling to find a way to NOT return the last period:
(\d{1,3}\.?)+(?=(\Z|\s|\-|\;|\:|\?|\!|\.|\,|\)))
It is returning
6.6.
but I want it to return
6.6

You require: match d.d.d.d. or d.d.dxxx, and regardless of whether it ends with a "." or not, always stop at the last d (never the dot).
What's wrong with just: (\d(\.\d)*)
If you want your dotted-digit string to be terminated by a set of characters, put a look-ahead after it, as you have in your question:
(\d(\.\d)*)(?=(\Z|\s|\-|\;|\:|\?|\!|\.|\,|\)))
If you also want it to match a stand-alone string (with or without the terminator), add a ? after the look-ahead:
(\d(\.\d)*)(?=(\Z|\s|\-|\;|\:|\?|\!|\.|\,|\)))?
For more than one digits, just replace \d with \d{1,3} etc.

The regex (\d{1,3}(?:\.\d{1,3})*)\.{0,1} should work.
In the Group 1 (if taken Group 0 as the entire match) will be stored the string you want to keep, without the . at the end, in case it contains it.
It basically does:
Start matching 1-3 digits
Then match strings like .d, .dd, or .ddd
If the match ends with a ., it won't take it because it isn't inside the group.
Do your tests and let us know if it works with all your examples.
Edit:
Replace + with *

/\d{1,3}(\.\d{1,3})*/
Quick explanation:
\d{1,3} # Match 1-3 digits
( # Start Capture Group 1
\. # Match '.'
\d{1,3} # Match 1-3 digits
)* # End Capture Group 1 - match 0 or more times

You can write your own Regular expression and test with dummy data on following Site:
http://myregexp.com/

Regex for password that requires one numeric or one non-alphanumeric character

I'm looking for a rather specific regex and I almost have it but not quite.
I want a regex that will require at least 5 charactors, where at least one of those characters is either a numeric value or a nonalphanumeric character.
This is what I have so far:
^(?=.*[\d]|[!##$%\^*()_\-+=\[{\]};:|\./])(?=.*[a-z]).{5,20}$
So the problem is the "or" part. It will allow non-alphanumeric values, but still requires at least one numeric value. You can see that I have the or operator "|" between my require numerics and the non-alphanumeric, but that doesn't seem to work.
Any suggestions would be great.

Try:
^(?=.*(\d|\W)).{5,20}$
A short explanation:
^ # match the beginning of the input
(?= # start positive look ahead
.* # match any character except line breaks and repeat it zero or more times
( # start capture group 1
\d # match a digit: [0-9]
| # OR
\W # match a non-word character: [^\w]
) # end capture group 1
) # end positive look ahead
.{5,20} # match any character except line breaks and repeat it between 5 and 20 times
$ # match the end of the input

Perhaps this may work for you:
^.*[\d\W]+.*$
And use some code like this to check string size:
if(str.len >= 5 && str.len =< 20 && regex.ismatch(str, "^.*[\d\W]+.*$")) { ... }

Is it really necessary to stuff everything in a giant regex? Just use program logic (5 ≤ length(s) ≤ 20) ∧ (/[[:digit:]]/ ∨ /[^[:alpha:]]/). Far more readable syntactically and semantically, I think.

Pretty simple solution, once S.Mark got me on the right track, just needed to merge my numeric and non-alphanumeric pieces as one.
Here's the final regex for anyone that's interested:
^(?=.*[\d!##$%\^*()_\-+=\[{\]};:|\./])(?=.*[a-z]).{5,20}$
This will allow any password between 5 and 20 characters and requires at least one letter and one numeric and/or one non-alphanumeric character.

How about like this?
^.*?[\d!##$%\^*()_\-+=\[{\]};:|\./].*$
For the length 5,20 Please use normal strlen function

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js