Regular expression only 2 consecutive specific characters - regex

I'm trying to build a regular expression for an abstract filesystem. It should:
Start with letters [a-zA-Z], '/', or '.'
Only allow one consecutive occurrence of '/'
Only allow two consecutive occurrences of '.'
Here's what I have so far (works not allowing 3 '.'s but works when typing only one. Any input is greatly appreciated. I tried positive and negative lookaheads for the second group but it still has the same problem.
(?!.*\/{2})(?!.*\.{3})^[A-Za-z\/\.]*$
My Regex101 link:
https://regex101.com/r/xM8oY5/1

I have added a negative lookahead, that matches a dot . surrounded by two not-dot characters.
/(?!(.*[^.])?\.([^.].*)?$)(?!.*\/{2})(?!.*\.{3})^[A-Za-z\/\.]*$/
^^^^^^^^^^^^^^^^^^^^^^^^^
(.*[^.])? -> some arbitrary characters and at least one not-dot
\. -> the dot
([^.].*)?$ -> one not-dot and some arbitrary characters
Both blocks - before and after the dot - are optional, if the single dot comes at start or end of the string.
Test it on regex101.

Related

Regex Help required for User-Agent Matching

Have used an online regex learning site (regexr) and created something that works but with my very limited experience with regex creation, I could do with some help/advice.
In IIS10 logs, there is a list for time, date... but I am only interested in the cs(User-Agent) field.
My Regex:
(scan\-\d+)(?:\w)+\.shadowserver\.org
which matches these:
scan-02.shadowserver.org
scan-15n.shadowserver.org
scan-42o.shadowserver.org
scan-42j.shadowserver.org
scan-42b.shadowserver.org
scan-47m.shadowserver.org
scan-47a.shadowserver.org
scan-47c.shadowserver.org
scan-42a.shadowserver.org
scan-42n.shadowserver.org
scan-42o.shadowserver.org
but what I would like it to do is:
Match a single number with the option of capturing more than one: scan-2 or scan-02 with an optional letter: scan-2j or scan-02f
Append the rest of the User Agent: .shadowserver.org to the regex.
I will then add it to an existing URL Rewrite rule (as a condition) to abort the request.
Any advice/help would be very much appreciated
Tried:
To write a regex for IIS10 to block requests from a certain user-agent
Expected:
It to work on single numbers as well as double/triple numbers with or without a letter.
(scan\-\d+)(?:\w)+\.shadowserver\.org
Input Text:
scan-2.shadowserver.org
scan-02.shadowserver.org
scan-2j.shadowserver.org
scan-02j.shadowserver.org
scan-17w.shadowserver.org
scan-101p.shadowserver.org
UPDATE:
I eventually came up with this:
scan\-[0-9]+[a-z]{0,1}\.shadowserver\.org
This is explanation of your regex pattern if you only want the solution, then go directly to the end.
(scan\-\d+)(?:\w)+
(scan\-\d+) Group1: match the word scan followed by a literal -, you escaped the hyphen with a \, but if you keep it without escaping it also means a literal - in this case, so you don't have to escape it here, the - followed by \d+ which means one more digit from 0-9 there must be at least one digit, then the value inside the group will be saved inside the first capturing group.
(?:\w)+ non-capturing group, \w one character which is equal to [A-Za-z0-9_], but the the plus + sign after the non-capturing group (?:\w)+, means match the whole group one or more times, the group contains only \w which means it will match one or more word character, note the non-capturing group here is redundant and we can use \w+ directly in this case.
Taking two examples:
The first example: scan-02.shadowserver.org
(scan\-\d+)(?:\w)+
scan will match the word scan in scan-02 and the \- will match the hyphen after scan scan-, the \d+ which means match one or more digit at first it will match the 02 after scan- and the value would be scan-02, then the (?:\w)+ part, the plus + means match one or more word character, at least match one, it will try to match the period . but it will fail, because the period . is not a word character, at this point, do you think it is over ? No , the regex engine will return back to the previous \d+, and this time it will only match the 0 in scan-02, and the value scan-0 will be saved inside the first capturing group, then the (?:\w)+ part will match the 2 in scan-02, but why the engine returns back to \d+ ? this is because you used the + sign after \d+, (?:\w)+ which means match at least one digit, and one word character respectively, so it will try to do what it is asked to do literally.
The second example: scan-2.shadowserver.org
(scan\-\d+)(?:\w)+
(scan\-\d+) will match scan-2, (?:\w)+ will try to match the period after scan-2 but it fails and this is the important point here, then it will go back to the beginning of the string scan-2.shadowserver.org and try to match (scan\-\d+) again but starting from the character c in the string , so s in (scan\-\d+) faild to match c, and it will continue trying, at the end it will fail.
Simple solution:
(scan-\d+[a-z]?)\.shadowserver\.org
Explanation
(scan-\d+[a-z]?), Group1: will capture the word scan, followed by a literal -, followed by \d+ one or more digits, followed by an optional small letter [a-z]? the ? make the [a-z] part optional, if not used, then the [a-z] means that there must be only one small letter.
See regex demo

Trying to match zero outside the word bounderies

I have patterns like
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
I can match word TCELL and TBNK with this RegEX
^(\D+)-(\d+)-(\d+)([A-Z1-9]+)?.*
But if I have patterns like
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
the above regex returns
T2 and C192 instead of T20NK and C1920 respectively
Is there a general regex that matches Nzeros out side of these word boundaries?
Let's consider all 4 examples of your input:
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
The first group, between start of line and the first "_" (e.g. FQC19515 in row 1)
consists of:
a non-empty sequence of letters,
a non-empty sequence of digits.
So the regex matching it, including the start of line anchor and a capturing group is:
^([A-Z]+\d+)
You used \D instead of [A-Z] but I think that [A-Z] is
more specific, as it matches only letters an not e.g. "_".
The next source char is _, so the regex can also include _.
A now the more diificult part: The second group to be captured has
actually 2 variants:
a sequence of letters and a sequence of digits (after that there is
a "_"),
a sequence of letters, a sequence of digits and another sequence of
letters (after that there are digits that you want to omit).
So the most intuitive way is to define 2 alternatives, each with
a respective positive lookahead:
alternative 1: [A-Z]+\d+(?=_),
alternative 2: [A-Z]+\d+[A-Z]+(?=\d).
But there is a bit shorter way. Notice that both alternatives start
from [A-Z]+\d+.
So we can put this fragment at the first place and only the rest
include as a non-capturing group ((?:...)), with 2 alternatives.
All the above should be surrounded with a capturing group:
([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
So the whole regex can be:
^([A-Z]+\d+)_([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
with m option ("^" matches also the start of each line).
For a working example see https://regex101.com/r/GDdt10/1
Your regex: ^(\D+)-(\d+) is wrong as after a sequence of non-digits
(\D+) you specified a minus which doesn't occur in your source.
Also the second minus does not correspond to your input.
Edit
To match all your strings, I modified slightly the previous regex.
The changes are limited to the matching group No 2 (after _):
Alternative No 1: [A-Z]{2,}+(?=\d) - two or more letters, after them
there is a digit, to be omitted. It will match TCELL and TBNK.
Alternative No 2: [A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)) - the previous
content of this group. It will match two remaining cases.
So the whole regex is:
^([A-Z]+\d+)_([A-Z]{2,}+(?=\d)|[A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
For a working example see https://regex101.com/r/GDdt10/2
As far as I understand, you could use:
^[A-Z]+\d+_\K[A-Z0-9]{5}
Explanation:
^ # beginning of line
[A-Z]+ # 1 or more capitals
\d+_ # 1 or more digit and 1 underscore
\K # forget all we have seen until this position
[A-Z0-9]{5} # 5 capitals or digits
Demo

How do you specify multiples in negative character classes in regular expressions?

I am trying to write a regular expression to search for anything but digits or the * or - characters, with one caveat. Where I'm hitting a wall is that I need to be able to allow three or less digits to be found but not four or more, though even one * or - shouldn't be found.
This is what I have so far (for three matches):
.*?([^0-9\*-]+).*?([^0-9\*-]+).*?([^0-9\*-]+).*?
I have no idea where to insert {4,} for the digits (I've tried and it doesn't seem to work anywhere) or how to change it to do as I want.
For instance, in "Jack has* 777 1883874 -sheep-" I'd like it to return "Jack has 777 sheep". Or in "2343klj-3***.net" I'd like it to return "klj 3 .net"
You may use the following regex (replacing with a literal space, " "):
(?:[-*\s]|\d{4,})+
See the regex demo. Replace with $1 (to insert one captured horizontal whitespace if any).
Details
(?:[-*\s]|\d{4,})+ - a non-capturing group matching one or more consecutive repetitions of
[-*\s] - 0+ whitespaces, - or/and *
| - or
\d{4,} - 4+ digits.
Next, to remove all leading and trailing whitespace you may use
^\s+|\s+$
and replace with an empty string. ^\s+ matches 1+ whitespaces at the start of the string and \s+$ matches 1+ whitespaces at the end of the string.
With the help here, this is what works. It may be impossible to do it all in one regex because of the conflict of needing no spaces at the beginning and end but spaces in between each remaining grouping.
First, a find and replace using ([-*\h]|\d{4,})+ and replacing with a space.
Second, using ^\s*(.*)\s*$.

RegEx expression to handle multiple conditions of breaking sentences

I am trying to make a regex that is used in an exception.
Therefore it must return false for these sentences (the leading digits are included in the strings):
3.{17} this is italics and should break.{18} 
4. this is another sentence and should break. 
5. This is another sentence and should break. 
And it must return true for these:
There are 2 reasons for this 1. you are here and 2. you are communicating. 
Is it 2? they wanted to know. 
1 digit at the beginning but with 1. with a period should return true.
In other words, if the beginning of the string is a number followed by a period, it should return false (even if "\{\d+\}" follows it optionally) and the character following the space does not matter. And it must return true if the number and period (or ! or ?) is embedded in the sentence followed by a lower case character, in other cases it must be false.
As a further note: this goes into a java properties file, and the value is then passed to a perl5 regex engine to return broken text.
I try to express it in one expression, but somehow I cannot get it right.
This is what have come up with until now:
^([^0-9\.]+[\.]|
[^\.!\?]*[\?!]+[\?!\.]+|
[0-9]+[^\?!\.]+[\?!\.]+|
[^0-9]*[0-9]+[^\?!\.]+[\?!\.]+)
(\{\d+\}[\u0020\u00A0]|
[\u0020\u00A0]*)[a-z]
I seem to arrived at an impasse and can't see what is I have wrong.
Thanks for any advice.
Update:
A simpler format with look-ahead: ^(?!\d+\.)[^.!?]*[.!?]+(\{\d+\}\s|\s*)\p{Ll} based on the comments.
You may use
^(?!\d+\.)[^.!?]*[.!?]+(\{\d+\}\s|\s*)\p{Ll}
See the regex demo.
The pattern matches:
^ - start of string anchor
(?!\d+\.) - a negative lookahead that will fail the match if its pattern is matched at the start of the string: 1+ digits followed with a dot
[^.!?]* - 0+ chars other than ., ! and ?
[.!?]+ - 1 or more ., ! or ? symbols
(\{\d+\}\s|\s*) - either a { + 1 or more digits + } or 0+ whitespaces (if you are not interested in the value captured with this capturing group, you may turn it into a non-capturing one by adding ?: after the first ().
\p{Ll} - a lowercase letter (if a u modifier is used, it will also match all Unicode lowercase letters).

Match against 1 hyphen per any number of digit groups

I'm trying to come up with some regex to match against 1 hyphen per any number of digit groups. No characters ([a-z][A-Z]).
123-356-129811231235123-1235612346123451235
/[^\d-]/g
The one above will match the string below, but it will let the following go through:
1223--1235---123123-------
I was looking at the following post How to match hyphens with Regular Expression? for an answer, but I didn't find anything close.
#Konrad Rudolph gave a good example.
Regular expression to match 7-12 digits; may contain space or hyphen
This tool is useful for me http://www.gskinner.com/RegExr/
Assuming it can't ever start with a hyphen:
^\d(-\d|\d)*$
broken down:
^ # match beginning of line
\d # match single digit
(-\d|\d)+ # match hyphen & digit or just a digit (0 or more times)
$ # match end of line
That makes every hyphen have to have a digit immediately following it. Keep in mind though, that the following are examples of legal patterns:
213-123-12314-234234
1-2-3-4-5-6-7
12234234234
gskinner example
Alternatively:
^(\d+-)+(\d+)$
So it's one or more group(s) of digits followed by hyphen + final group of digits.
Nothing very fancy, but in my tests it matched only when there were hyphen(s) with digits on both sides.