Regex that only matches on odd/even indices - regex

Is there a regex that matches a string only when it starts on an odd or an even index? My use case is a hex string in which I want to replace certain "bytes".
Now, when trying to match 20 (space), 20 in "7209" would be matched as well even though it consists of the bytes 72 and 09. I am restricted to the regex implementation of Notepad++ in this case, so I'm not able to check the match index as e.g. in Java.
My sample input looks like:
324F8D8A20561205231920
I set up a testing page here, the regex should only match the first and the last occurence of 20, since the one in the middle starts on an odd index.

You can use the following regex to match 20 at even positions inside a hex string:
20(?=(?:[\da-fA-F]{2})*$)
See demo
I assume the string has no spaces in this case.
In case you have spaces between the values (or any other symbols), this could be an alternative (with $1XX-like replacement string):
((?:.{2})*?)20
See another demo

This seems to work for evens:
rx <- "^(.{2})*(20)"
strings <- c("7209","2079","9720")
grepl(rx,strings) # [1] FALSE TRUE TRUE

Not sure what Notepad++ uses for regex engine - it's been a while since I used it. This works in javascript...
/^(?:..)*?(20)/
...
/^ # start regex
(?: # non capturing group
.. # any character (two times)
)*? # close group, and repeat zero or more times, un-greedily
(20) # capture `20` in group
/ # end regex

Related

Pattern to match everything except a string of 5 digits

I only have access to a function that can match a pattern and replace it with some text:
Syntax
regexReplace('text', 'pattern', 'new text'
And I need to return only the 5 digit string from text in the following format:
CRITICAL - 192.111.6.4: rta nan, lost 100%
Created Time Tue, 5 Jul 8:45
Integration Name CheckMK Integration
Node 192.111.6.4
Metric Name POS1
Metric Value DOWN
Resource 54871
Alert Tags 54871, POS1
So from this text, I want to replace everything with "" except the "54871".
I have come up with the following:
regexReplace("{{ticket.description}}", "\w*[^\d\W]\w*", "")
Which almost works but it doesn't match the symbols. How can I change this to match any word that includes a letter or symbol, essentially.
As you can see, the pattern I have is very close, I just need to include special characters and letters, whereas currently it is only letters:
You can match the whole string but capture the 5-digit number into a capturing group and replace with the backreference to the captured group:
regexReplace("{{ticket.description}}", "^(?:[\w\W]*\s)?(\d{5})(?:\s[\w\W]*)?$", "$1")
See the regex demo.
Details:
^ - start of string
(?:[\w\W]*\s)? - an optional substring of any zero or more chars as many as possible and then a whitespace char
(\d{5}) - Group 1 ($1 contains the text captured by this group pattern): five digits
(?:\s[\w\W]*)? - an optional substring of a whitespace char and then any zero or more chars as many as possible.
$ - end of string.
The easiest regex is probably:
^(.*\D)?(\d{5})(\D.*)?$
You can then replace the string with "$2" ("\2" in other languages) to only place the contents of the second capture group (\d{5}) back.
The only issue is that . doesn't match newline characters by default. Normally you can pass a flag to change . to match ALL characters. For most regex variants this is the s (single line) flag (PCRE, Java, C#, Python). Other variants use the m (multi line) flag (Ruby). Check the documentation of the regex variant you are using for verification.
However the question suggest that you're not able to pass flags separately, in which case you could pass them as part of the regex itself.
(?s)^(.*\D)?(\d{5})(\D.*)?$
regex101 demo
(?s) - Set the s (single line) flag for the remainder of the pattern. Which enables . to match newline characters ((?m) for Ruby).
^ - Match the start of the string (\A for Ruby).
(.*\D)? - [optional] Match anything followed by a non-digit and store it in capture group 1.
(\d{5}) - Match 5 digits and store it in capture group 2.
(\D.*)? - [optional] Match a non-digit followed by anything and store it in capture group 3.
$ - Match the end of the string (\z for Ruby).
This regex will result in the last 5-digit number being stored in capture group 2. If you want to use the first 5-digit number instead, you'll have to use a lazy quantifier in (.*\D)?. Meaning that it becomes (.*?\D)?.
(?s) is supported by most regex variants, but not all. Refer to the regex variant documentation to see if it's available for you.
An example where the inline flags are not available is JavaScript. In such scenario you need to replace . with something that matches ALL characters. In JavaScript [^] can be used. For other variants this might not work and you need to use [\s\S].
With all this out of the way. Assuming a language that can use "$2" as replacement, and where you do not need to escape backslashes, and a regex variant that supports an inline (?s) flag. The answer would be:
regexReplace("{{ticket.description}}", "(?s)^(.*\D)?(\d{5})(\D.*)?$", "$2")

PCRE Regex: Is it possible to check within only the first X characters of a string for a match

PCRE Regex: Is it possible for Regex to check for a pattern match within only the first X characters of a string, ignoring other parts of the string beyond that point?
My Regex:
I have a Regex:
/\S+V\s*/
This checks the string for non-whitespace characters whoich have a trailing 'V' and then a whitespace character or the end of the string.
This works. For example:
Example A:
SEBSTI FMDE OPORV AWEN STEM students into STEM
// Match found in 'OPORV' (correct)
Example B:
ARKFE SSETE BLMI EDSF BRNT CARFR (name removed) Academy Networking Event
//Match not found (correct).
Re: The capitalised text each letter and the letters placement has a meaning in the source data. This is followed by generic info for humans to read ("Academy Networking Event", etc.)
My Issue:
It can theoretically occur that sometimes there are names that involve roman numerals such as:
Example C:
ARKFE SSETE BLME CARFR Academy IV Networking Event
//Match found (incorrect).
I would like my Regex above to only check the first X characters of the string.
Can this be done in PCRE Regex itself? I can't find any reference to length counting in Regex and I suspect this can't easily be achieved. String lengths are completely arbitary. (We have no control over the source data).
Intention:
/\S+V\s*/{check within first 25 characters only}
ARKFE SSETE BLME CARFR Academy IV Networking Event
^
\- Cut off point. Not found so far so stop.
//Match not found (correct).
Workaround:
The Regex is in PHP and my current solution is to cut the string in PHP, to only check the first X characters, typically the first 20 characters, but I was curious if there was a way of doing this within the Regex without needing to manipulate the string directly in PHP?
$valueSubstring = substr($coreRow['value'],0,20); /* first 20 characters only */
$virtualCount = preg_match_all('/\S+V\s*/',$valueSubstring);
The trick is to capture the end of the line after the first 25 characters in a lookahead and to check if it follows the eventual match of your subpattern:
$pattern = '~^(?=.{0,25}(.*)).*?\K\S+V\b(?=.*\1)~m';
demo
details:
^ # start of the line
(?= # open a lookahead assertion
.{0,25} # the twenty first chararcters
(.*) # capture the end of the line
) # close the lookahead
.*? # consume lazily the characters
\K # the match result starts here
\S+V # your pattern
\b # a word boundary (that matches between a letter and a white-space
# or the end of the string)
(?=.*\1) # check that the end of the line follows with a reference to
# the capture group 1 content.
Note that you can also write the pattern in a more readable way like this:
$pattern = '~^
(*positive_lookahead: .{0,20} (?<line_end> .* ) )
.*? \K \S+ V \b
(*positive_lookahead: .*? \g{line_end} ) ~xm';
(The alternative syntax (*positive_lookahead: ...) is available since PHP 7.3)
You can find your pattern after X chars and skip the whole string, else, match your pattern. So, if X=25:
^.{25,}\S+V.*(*SKIP)(*F)|\S+V\s*
See the regex demo. Details:
^.{25,}\S+V.*(*SKIP)(*F) - start of string, 25 or more chars other than line break chars, as many as possible, then one or more non-whitespaces and V, and then the rest of the string, the match is failed and skipped
| - or
\S+V\s* - match one or more non-whitespaces, V and zero or more whitespace chars.
Any V ending in the first 25 positions
^.{1,24}V\s
See regex
Any word ending in V in the first 25 positions
^.{1,23}[A-Z]V\s

How does this regex for FQDNs (excluding.arpa) work?

I am trying to understand how regex works. I understand it little by little. However, I don't understand this one completely. It's basically a regex for fully qualified domain names but a requirement is that the ending can't be .arpa.
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
https://regex101.com/r/hU6tP0/3
This doesn't match google.uk. If I change it to:
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{1,63}[^.arpa]$)
It works again.
But this works as well
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
Here is my thought process for
?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
I see it as this
(?=
Is a positive look ahead (Can someone explain to me what this actually means?) As I understand it now, it just means that the string needs to match the regex.
^.{4,253}$)
Match all characters but it needs to be between 4 and 253 characters long.
(^([a-zA-Z0-9]{1,63}\.)
Start a capture group and make another capture group within. This capture group says that every non special character can be written 1 to 63 times or till the . is written.
+
The previous capture group can be repeated indefinitely, but it should always end with a .. This way the next capture group is started.
[a-zA-Z]{2,63}
Then as many times as you want you can write a to z with upper, but it needs to be between 2 and 63.
[^.arpa]$)
The last characters can't be .arpa.
Can someone tell me where I am going wrong?
This doesn't do what you think it does:
[^.arpa]
All that says is 'ends with something that isn't one of the letter apr.' - it's a negated character class.
You might be thinking of a negative lookahead assertion:
(?!\.arpa)$
But if you're trying to compound multiple criteria in a regex, I'd suggest you're probably using the wrong tool for the job. It ends up complicated and hard to debug, thanks to greedy/non-greedy matching, etc.
Your 'positive/negative' lookaheads are to match a piece of a pattern that aren't surrounded by other pieces of pattern. But that can have some unexpected outcomes if you're matching variable widths, because the regex engine will backtrack until it finds something that matches.
A simpler example:
([\w.]+)(?!arpa)$
Applied to:
www.test.arpa
Will it match? What's in the group?
... it will match, because [\w\.]+ will consume all of it, and then the lookahead won't "see" anything.
If you use:
([\w]+)\.(?!arpa)
Instead though - you'll capture.... www, but you won't match test (with e.g. g flag, because the www doesn't have .arpa after it, but the test does.
https://regex101.com/r/hU6tP0/5
It really does get complicated using negative assertions in a pattern as a result. I'd suggest simply not doing so, and applying two separate tests. It's hard for you to figure out, and it's hard for a future maintenance programmer too!
This is an analysis of your regex:
(?=^.{4,253}$) # force min length: 4 chars, max length: 253 chars
( # Capturing Group 1 (CG1) - not needed
^ # Match start of the string
( # CG2 (can be a non capturing group '(?:...)')
[a-zA-Z0-9]{1,63} # any sequence of letters and numbers with length between 1 and 63
\. # a literal dot
)+ # CLOSE CG2
[a-zA-Z]{1,63} # any letter sequence with length between 1 to 63
[^.arpa] # a negated char class: any char that is not a "literal" '.','a','r','p' (last 'a' is redundant)
$ # end of the string
) # CLOSE CG1
To avoid the tail of the string to be .arpa you need to use a negative lookahead (?!...), so modify just like this:
(?=^.{4,253}$)(?!.*\.arpa$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
An online demo
Update:
I've upgraded the regex to rationalise it (i've incorporated also the Sobrique suggestion adding an important details):
/^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i
Compact version online demo
Legenda
/ # js regex delimiter
^ # start of the string
(?=.{4,253}$) # force min length: 4 chars, max length: 253 chars
(?: # Non capturing group 1 (NCG1)
[a-z0-9]{1,63} # any letter or digit in a sequence with length from 1 to 63 chars
[.] # a literal dot '.' (more readable than \.)
)+ # CLOSE NCG1 - repeat its content one or more time
(?!arpa$) # force that after the last literal dot '.' the string does not end with 'arpa' (i've added '$' to Sobrique suggestion instead it prevents also '.arpanet' too)
[a-z]{2,63} # a sequence of letters with length from 2 to 63
$ # end of the string
/i # Close the regex delimiter and add case insensitive flag [a-z] match also [A-Z] and viceversa
var re = /^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i;
var tests = ['google.uk','domain.arpa','domain.arpa2','another.domain.arpa.net','domain.arpanet'];
var m;
while(t = tests.pop()) {
document.getElementById("r").innerHTML += '"' + t + '"<br/>';
document.getElementById("r").innerHTML += 'Valid domain? ' + ( (t.match(re)) ? '<font color="green">YES</font>' : '<font color="red">NO</font>') + '<br/><br/>';
}
<div id="r"/>

Regular expression for crossword solution

This is a crossword problem. Example:
the solution is a 6-letter word which starts with "r" and ends with "r"
thus the pattern is "r....r"
the unknown 4 letters must be drawn from the pool of letters "a", "e", "i" and "p"
each letter must be used exactly once
we have a large list of candidate 6-letter words
Solutions: "rapier" or "repair".
Filtering for the pattern "r....r" is trivial, but finding words which also have [aeip] in the "unknown" slots is beyond me.
Is this problem amenable to a regex, or must it be done by exhaustive methods?
Try this:
r(?:(?!\1)a()|(?!\2)e()|(?!\3)i()|(?!\4)p()){4}r
...or more readably:
r
(?:
(?!\1) a () |
(?!\2) e () |
(?!\3) i () |
(?!\4) p ()
){4}
r
The empty groups serve as check marks, ticking off each letter as it's consumed. For example, if the word to be matched is repair, the e will be the first letter matched by this construct. If the regex tries to match another e later on, that alternative won't match it. The negative lookahead (?!\2) will fail because group #2 has participated in the match, and never mind that it didn't consume anything.
What's really cool is that it works just as well on strings that contain duplicate letters. Take your redeem example:
r
(?:
(?!\1) e () |
(?!\2) e () |
(?!\3) e () |
(?!\4) d ()
){4}
m
After the first e is consumed, the first alternative is effectively disabled, so the second alternative takes it instead. And so on...
Unfortunately, this technique doesn't work in all regex flavors. For one thing, they don't all treat empty/failed group captures the same. The ECMAScript spec explicitly states that references to non-participating groups should always succeed.
The regex flavor also has to support forward references--that is, backreferences that appear before the groups they refer to in the regex. (ref) It should work in .NET, Java, Perl, PCRE and Ruby, that I know of.
Assuming that you meant that the unknown letters must be among [aeip], then a suitable regex could be:
/r[aeip]{4,4}r/
What's the front end language being used to compare strings, is it java, .net ...
here is an example/psuedo code using java
String mandateLetters = "aeio"
String regPattern = "\\br["+mandateLetters+"]*r$"; // or if for specific length \\br[+mandateLetters+]{4}r$
Pattern pattern = Pattern.compile(regPattern);
Matcher matcher = pattern.matcher("is this repair ");
matcher.find();
Why not replace each '.' in your original pattern with '[aeip]'?
You'd wind up with a regex string r[aeip][aeip][aeip][aeip]r.
This could of course be shortened to r[aeip]{4,4}r, but that would be a pain to implement in the general case and probably wouldn't improve the code any.
This doesn't address the issue of duplicate letter use. If I were coding it, I'd handle that in code outside the regexp - mostly because the regexp would get more complicated than I'd care to handle.
So the "only once" part is the critical thing. Listing all permutations is obviously not feasible. If your language/environment supports lookaheads and backreferences you can make it a bit easier for yourself:
r(?=[aeip]{4,4})(.)(?!\1)(.)(?!\1|\2)(.)(?!\1|\2|\3).r
Still quite ugly, but here is how it works:
r # match an r
(?= # positive lookahead (doesn't advance position of "cursor" in input string)
[aeip]{4,4}
) # make sure that there are the four desired character ahead
(.) # match any character and capture it in group 1
(?!\1)# make sure that the next character is NOT the same as the previous one
(.) # match any character and capture it in group 2
(?!\1|\2)
# make sure that the next character is neither the first nor the second
(.) # match any character and capture it in group 3
(?!\1|\2|\3)
# same thing again for all three characters
. # match another arbitrary character
r # match an r
Working demo.
This is neither really elegant nor scalable. So you might just want to use r([aiep]{4,4})r (capturing the four critical letters) and ensure the additional condition without regex.
EDIT: In fact, the above pattern is only really useful and necessary if you just want to ensure that you have 4 non-identical characters. For your specific case, again using lookaheads, there is simpler (despite longer) solution:
r(?=[^a]*a[^a]*r)(?=[^e]*e[^e]*r)(?=[^i]*i[^i]*r)(?=[^p]*p[^p]*r)[aeip]{4,4}r
Explained:
r # match an r
(?= # lookahead: ensure that there is exactly one a until the next r
[^a]* # match an arbitrary amount of non-a characters
a # match one a
[^a]* # match an arbitrary amount of non-a characters
r # match the final r
) # end of lookahead
(?=[^e]*e[^e]*r) # ensure that there is exactly one e until the next r
(?=[^i]*i[^i]*r) # ensure that there is exactly one i until the next r
(?=[^p]*p[^p]*r) # ensure that there is exactly one p until the next r
[aeip]{4,4}r # actually match the rest to include it in the result
Working demo.
For r....m with a pool of deee, this could be adjusted as:
r(?=[^d]*d[^d]*m)(?=[^e]*(?:e[^e])*{3,3}m)[de]{4,4}m
This ensures that there is exactly one d and exactly 3 es.
Working demo.
not fully regex due to sed multi regex action
sed -n -e '/^r[aiep]\{4,\}r$/{/\([aiep]\).*\1/!p;}' YourFile
take pattern 4 letter in group aeipsourround by r, keep only line where no letter in the sub group is found twice.
A more scalable solution (no need to write \1, \2, \3 and so on for each letter or position) is to use negative lookahead to assert that each character is not occurring later:
^r(?:([aeip])(?!.*\1)){4}r$
more readable as:
^r
(?:
([aeip])
(?!.*\1)
){4}
r$
Improvements
This was a quick solution which works in the situation you gave us, but here are some additional constraints to have a robuster version:
If the "pool of letters" may share some letters with the end of string, include the end of pattern in the lookahead:
^r(?:([aeip])(?!.*\1.*\2)){4}(r$)
(may not work as intended in all regex flavors, in which case copy-paste the end of pattern instead of using \2)
If some letters must be present not only once but a different fixed number of times, add a separate lookahead for all letters sharing this number of times. For instance, "r....r" with one "a" and one "p" but two "e" would be matched by this regex (but "rapper" and "repeer" wouldn't):
^r(?:([ap])(?!.*\1.*\3)|([e])(?!.*\2.*\2.*\3)){4}(r$)
The non-capturing groups now has 2 alternatives: ([ap])(?!.*\1.*\3) which matches "a" or "p" not followed anywhere until ending by another one, and ([e])(?!.*\2.*\2.*\3) which matches "e" not followed anywhere until ending by 2 other ones (so it fails on the first one if there are 3 in total).
BTW this solution includes the above one, but the end of pattern is here shifted to \3 (also, cf. note about flavors).

How to write regular expression to find one or more digits separated by periods without returning the last period?

How to write regular expression to find between one and three digits separated by periods without returning the last period? For example, find the string
1.1.
and it would also need to match
1.1
or simply
1
Likewise, it needs to support between one and three digits, so
11.11.11
and
111.111.111
need to work as well.
So..the string won't always end with a period, but it may. Further, if it does end with a period, don't return the last period (so, using a positive lookahead). So, 1.1. if matched would return 1.1
Here is what I have so far, but I am struggling to find a way to NOT return the last period:
(\d{1,3}\.?)+(?=(\Z|\s|\-|\;|\:|\?|\!|\.|\,|\)))
It is returning
6.6.
but I want it to return
6.6
You require: match d.d.d.d. or d.d.dxxx, and regardless of whether it ends with a "." or not, always stop at the last d (never the dot).
What's wrong with just: (\d(\.\d)*)
If you want your dotted-digit string to be terminated by a set of characters, put a look-ahead after it, as you have in your question:
(\d(\.\d)*)(?=(\Z|\s|\-|\;|\:|\?|\!|\.|\,|\)))
If you also want it to match a stand-alone string (with or without the terminator), add a ? after the look-ahead:
(\d(\.\d)*)(?=(\Z|\s|\-|\;|\:|\?|\!|\.|\,|\)))?
For more than one digits, just replace \d with \d{1,3} etc.
The regex (\d{1,3}(?:\.\d{1,3})*)\.{0,1} should work.
In the Group 1 (if taken Group 0 as the entire match) will be stored the string you want to keep, without the . at the end, in case it contains it.
It basically does:
Start matching 1-3 digits
Then match strings like .d, .dd, or .ddd
If the match ends with a ., it won't take it because it isn't inside the group.
Do your tests and let us know if it works with all your examples.
Edit:
Replace + with *
/\d{1,3}(\.\d{1,3})*/
Quick explanation:
\d{1,3} # Match 1-3 digits
( # Start Capture Group 1
\. # Match '.'
\d{1,3} # Match 1-3 digits
)* # End Capture Group 1 - match 0 or more times
You can write your own Regular expression and test with dummy data on following Site:
http://myregexp.com/