Regex how to match two similar numbers in separate match groups? - regex

I got the following string:
[13:49:38 INFO]: Overall : Mean tick time: 4.126 ms. Mean TPS:
20.000
the bold numbers should be matched, each into its own capture group.
My current expression is (\d+.\d{3}) which matches 4.126 how can I match my 20.000 now into a second capture group? Adding the same capture group again makes it find nothing. So what I basically need is, "search for first number, then ignore everything until you find next digit."

You could use something like so: (\d+\.\d{3}).+?(\d+\.\d{3})$ (example here) which essentially is your regex (plus a minor fix) twice, with the difference that it will also look for the same pattern again at the end of the string.
Another minor note, your regex contains, a potential issue in which you are matching the decimal point with the period character. In regular expression language, the period character means any character, thus your expression would also match 4s222. Adding an extra \ in front makes the regex engine treat is as an actual character, and not a special one.

Related

How do I create a regex expression that does not allow the same 9 duplicate numbers in a social security number, with or without hyphens?

The first thing I tried to do, is get the regex matching what I DON'T want. This way, I could just flip it to NOT accept that same input. This is where I came up with the first part of this regex.
Accept all 9 digit numbers, where all 9 digits are identical (without dashes): "^(\d)\1{8}$". This expression works as expected (as seen here: (https://regex101.com/r/Ez8YC3/1)).
The second expression should do the same, with dashes formatted as follows xxx-xx-xxxx: "^(\d)\1{8}$". This expressions works as expected (as seen here: https://regex101.com/r/bodzIX/1).
Now what I want to do at this point, is combine them together to look for BOTH conditions. However when I do that it seems to break, and only match 9 digit numbers that are identical throughout WITH dashes: "^(\d)\1{2}-(\d)\1{1}-(\d)\1{3}$|^(\d)\1{8}$". This can be seen here: https://regex101.com/r/lPnksf/1.
I may be getting a little ahead of myself here, but in order to show my work as much as possible, I also tried flipping those regex separately, which also did not work as expected.
Condition #1 flipped: "^(?!(\d)\1{8})$". Can be seen here: https://regex101.com/r/ed51yk/1.
Condition #2 flipped: "^(?!(\d)\1{2}-(\d)\1{1}-(\d)\1{3})$". Can be seen here: https://regex101.com/r/UYfoMK/1.
I would expect the two expressions (when flipped) to match any 9 digit number (with or without dashes) where all numbers are not identical. How ever this does not happen at all.
This is the final regex that I came up with, which is clearly not doing what I would expect it to: "^(?!(\d)\1{2}-(\d)\1{1}-(\d)\1{3})$|^(?!(\d)\1{8})$". Can be seen here: https://regex101.com/r/9eHhF5/1
At the end of the day, I want to combine these 2 expressions, with this one (that already works as intended): "^(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d\d\d\d$". Can be seen here: https://regex101.com/r/AdRI8i/1.
I am still pretty new to regex, and really want to understand why I can't simply wrap the condition in (?!...) in order to match the opposite condition.
Thank you in advance
What you want to do is not flip, but reverse the regex logic.
Yes, to reverse the pattern logic, you should use a negative lookahead, but there are caveats.
First, the $ end of string anchor: if it was at the end of the "positive" regex, it must also be moved to the lookahead in the reverse pattern. So, your ^(?!(\d)\1{8})$ regex must be written as ^(?!(\d)\1{8}$). Same goes for your second regex.
Next, mind that each subsequent capturing group gets an incremented ID number, so you cannot keep the same backreferences when you "join" patterns with OR | operator. You must adjust these IDs to reflect their new values in the new regex.
So, you want to match a string that matches ^(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d\d\d\d$ first (let's note \d\d\d\d = \d{4}), then you can add restrictions with lookaheads:
(?!(\d)\1{8}$) - fails the match if, immediately from the current position, it matches identical 9 digits and then the string end comes
(?!(\d)\2\2-(\d)\2-(\d)\2{3}$) - (note the ID incrementing continuation) fails the match if, immediately from the current position, it matches identical to the first one 3 digits, -, identical 2 digits, -, identical 5 digits, and then the string end comes.
So, to follow your logic, you can use
^(?!(\d)\1{8}$)(?!(\d)\2\2-(\d)\2-(\d)\2{3}$)(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d{4}$
See the regex demo
As the lookaheads are non-consuming patterns, i.e. the regex index remains at the same position after matching their pattern sequences where it was before, the 3 lookaheads will all be tried at the start of the string (see the ^ anchor). If any of the three negative lookaheads at the start fails, the whole string match will be failed right away.
By this Regex you match what you dont want as social security number:
^(?:(\d)\1{8})|(?:(\d)\2{2}-\2{2}-\2{4})$
Demo
By this regex you match only what you want:
^(?!(?:(\d)\1{8})|(?:(\d)\2{2}-\2{2}-\2{4})).*$
Demo

Numbers between 99 and 9999999 regular expression

I am trying to generate a regular expression that will match any numbers within the range of 99 and 9999999. I have trouble understanding how generating number ranges generally works. I managed to find a range generator online that does the job for me, but I want to understand how it actually works.
My attempt to do this range is as follows:
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
This is supposed to match 99, any 3 digit number or any 4 digit number, but it does not work as expected. When tested it matches only numbers 99 and 3 digit numbers. Four digit numbers are not matched at all. If I only write the part for 4 digit numbers on its own as
[1-9][0-9][0-9][0-9]
It matches 4 digit numbers, but when I construct it as in the first example it does not work. Can someone give me some clarification how this actually works and how successfully to generate a regular expression for the range of 99 to 9999999.
Link to demo - Here
So you want to know how this works...
Regexs have no real understanding of the values of numbers in your string, it only cares how they are represented, which is why looking for numbers in a range seems more awkward than it should be. The only reason your regex engine can understand a range in a character class like [0-9] at all is because of the characters' positions in a list (a character range like [&-~] is just as valid, and equally understandable to it.)
So, to match a range like 99-9999999, ya gotta spell out what that looks like: literal "99", or three digits without a leading zero, or four digits without a leading zero, and so on.
But this is what your demo did, right? And it didn't work. Of your test string "9293" your regex only matched "929". What happened here is the regex engine is eager to return a complete match - as soon as it found one it returned it, even though a better/longer match might have occurred later.
Here's how that match happened. (I'll skip some details like grouping, as they're not super relevant here.)
Step 1.
The engine compares the first token in the regex with the first character in the string
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Success, they match.
Step 2.
The engine then advances both to the next token in the regex and the next character in the string and compares them.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ❌
Failure, no match. The engine would stop and return the failure here, but you're using alternation via |, so it knows there's an alternate expression to try.
Step 3.
The engine advances to the first token of the next alternate expression in the regex, and rewinds the position in the string.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Success, they match.
Step 4.
Continuing on.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Match.
Step 5.
And again.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Success. The complete expression matches. There's no need to try the remaining alternate. The match here returned is:
929
As you've probably figured out, if your input string was instead "9923" then step 2 would've matched and the engine there would've stopped and returned "99".
As you've also probably figured out, if you rearrange your alternate expressions from longest to shortest
([1-9][0-9][0-9][0-9]|[1-9][0-9][0-9]|99)
the longest would be attempted first, which would match and return your expected "9293".
Simplifying
It's still pretty wordy though, especially as you crank up the number of digits in your range. There are a couple things you can do to simplify it.
The character class [0-9] can be represented by the shorthand character class \d.
([1-9]\d\d\d|[1-9]\d\d|99)
And instead of repeating them use a quantifier in curly brackets like so:
([1-9]\d{3}|[1-9]\d{2}|99)
As it happens, quantifiers can also take the form of {min, max}, so you can combine the two similar alternates:
([1-9]\d{2,3}|99)
You might expect this to land you back returning "929" again, the engine being eager and all, but quantifiers are by default greedy so they'll try to pick up as much as they can. This lends itself well to your larger desired range:
([1-9]\d{2,6}|99)
Finishing up
What you do with it from here depends on what you need the regex to do. As it stands the parentheses are superfluous, there's no point in creating a capturing group of the entire regex itself. However a decision comes when you've got an input string like:
You will likely be eaten by 1000 grue.
If you're trying to pluck out how many grue are about to eat you, you might use
[1-9]\d{2,6}|99
which will return 1000.
However that sorta runs back into the original problem with your demo. If it's "12345678 grue", which is out of range, this'll match "1234567" which might not be what you want. You can make sure the number you've matched isn't immediately followed by (or preceded by) another digit by using negative lookarounds.
(?<!\d)([1-9]\d{2,6}|99)(?!\d)
(?<!\d) means "from this position, the prior character is not a digit" while (?!\d) means "from this position, the next character is not a digit."
The parentheses around the alternates are back as they're necessary for grouping here, otherwise the lookbehind would only be part of and apply in the first alternate expression and the lookahead would only be part of and apply in the second alternate.
On the other hand if you're trying to make sure the entire string only consists of a number in your range you'll want to instead use the anchors ^ and $ (start of string and end of string, respectively):
^([1-9]\d{2,6}|99)$
And finally you can trade the capturing group out for a non-capturing group (?:...), so:
^(?:[1-9]\d{2,6}|99)$
or
(?<!\d)(?:[1-9]\d{2,6}|99)(?!\d)
You'll still grab the number as the match, it just won't be repeated in a group capture. (Lookarounds are already non-capturing, no need to worry about those.)
First of all you need some string boundaries for you regex (anything except digit, in my example I use ^ and $ -- begging and end of line or string)
Try this one:
^([1-9][0-9]{2,6}|99)$

Putting a group within a group [123[a-u]]

I'm having a lot more difficulty than I anticipated in creating a simple regex to match any specific characters, including a range of characters from the alphabet.
I've been playing with regex101 for a while now, but every combination seems to result in no matches.
Example expression:
[\n\r\t\s\(\)-]
Preferred expression:
[[a-z][a-Z]\n\r\t\s\(\)-]
Example input:
(123) 241()-127()()() abc ((((((((
Ideally the expression will capture every character except the digits
I know I could always manually input "abcdefgh".... but there has to be an easier way. I also know there are easier ways to capture numbers only, but there are some special characters and letters which I may eventually need to include as well.
With regex you can set the regex expression to trigger on a range of characters like in your above example [a-z] that will capture any letter in the alphabet that is between a and z. To trigger on more than one character you can add a "+" to it or, if you want to limit the number of characters captured you can use {n} where n is the number of characters you want to capture. So, [a-z]+ is one or more and [a-z]{4} would match on the first four characters between a and z.
You can use partial intervals. For example, [a-j] will match all characters from a to j. So, [a-j]{2} for string a6b7cd will match only cd. Also you can use these intervals several times within same group like this: [a-j4-6]{4}. This regex will match ab44 but not ab47
Overlooked a pretty small character. The term I was looking for was "Alternative" apparently.
[\r\t\n]|[a-z] with the missing element being the | character. This will allow it to match anything from the first group, and then continue on to match the second group.
At least that's my conclusion when testing this specific example.

Do regular expression engine parse characters one by one?

It is kind of absurd to ask. Anyway, let me proceed.
While exploring the regular expressions, I come across a scenario, where the expression is
[A-Z0-9]+(\d\d\.\d+)
The input string is 123.456 and the pattern being matched is as follows,
The pattern [A-Z0-9]+ could have matched upto 135, but it is not followed by the 2 more digits (\d\d) and a literal dot character. So, engine went with having characters 23.456 in the first subgroup.
Whether the regular expression engine check for the match by parsing one character at a time ? I was in that assumption.
By looking at this, it seems not. The engine should be parsing characters as well moving the window of matching back and forth, so that it can help us matching the result.
Correct me if I am wrong.
A regex engine parse the string according to the pattern it is given.
Your pattern is [A-Z0-9]+(\d\d\.\d+). Given the 123.456 string, the [A-Z0-9]+ is first tried from the beginning of the string. 123 is grabbed first (since + is a greedy quantifier). Then the regex engine tries to match the rest of the string with (\d\d\.\d+) - and fails. Backtracking occurs because the regex engine knows that [A-Z0-9]+ can match a different (smaller) portion of the string, and thus, the 3 is dropped from the currently consumed chars, and (\d\d\.\d+) is retried to match 3.456, but there must be 2 digits before a dot. Backtracking happens again.
Thus, only 1 remains outside the capturing group 1 value.
Also, have a look at the steps generated at regex101.com (backtracking is marked with ):

Regex to find last occurrence of pattern in a string

My string being of the form:
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
I only want to match against the last segment of whitespace before the last period(.)
So far I am able to capture whitespace but not the very last occurrence using:
\s+(?=\.\w)
How can I make it less greedy?
In a general case, you can match the last occurrence of any pattern using the following scheme:
pattern(?![\s\S]*pattern)
(?s)pattern(?!.*pattern)
pattern(?!(?s:.*)pattern)
where [\s\S]* matches any zero or more chars as many as possible. (?s) and (?s:.) can be used with regex engines that support these constructs so as to use . to match any chars.
In this case, rather than \s+(?![\s\S]*\s), you may use
\s+(?!\S*\s)
See the regex demo. Note the \s and \S are inverse classes, thus, it makes no sense using [\s\S]* here, \S* is enough.
Details:
\s+ - one or more whitespace chars
(?!\S*\s) - that are not immediately followed with any 0 or more non-whitespace chars and then a whitespace.
You can try like so:
(\s+)(?=\.[^.]+$)
(?=\.[^.]+$) Positive look ahead for a dot and characters except dot at the end of line.
Demo:
https://regex101.com/r/k9VwC6/3
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=((?<=\S)\s+)).*
replaced by `>\1<`
> <
As a more generalized example
This example defines several needles and finds the last occurrence of either one of them. In this example the needles are:
defined word findMyLastOccurrence
whitespaces (?<=\S)\s+
dots (?<=[^\.])\.+
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)).*
replaced by `>\1<`
>..<
Explanation:
Part 1 .*
is greedy and finds everything as long as the needles are found. Thus, it also captures all needle occurrences until the very last needle.
edit to add:
in case we are interested in the first hit, we can prevent the greediness by writing .*?
Part 2 (?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+|(?<=**Not**NeedlePart)NeedlePart+))
defines the 'break' condition for the greedy 'find-all'. It consists of several parts:
(?=(needles))
positive lookahead: ensure that previously found everything is followed by the needles
findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)|(?<=**Not**NeedlePart)NeedlePart+
several needles for which we are looking. Needles are patterns themselves.
In case we look for a collection of whitespaces, dots or other needleparts, the pattern we are looking for is actually: anything which is not a needlepart, followed by one or more needleparts (thus needlepart is +). See the example for whitespaces \s negated with \S, actual dot . negated with [^.]
Part 3 .*
as we aren't interested in the remainder, we capture it and dont use it any further. We could capture it with parenthesis and use it as another group, but that's out of scope here
SIMPLE SOLUTION for a COMMON PROBLEM
All of the answers that I have read through are way off topic, overly complicated, or just simply incorrect. This question is a common problem that regex offers a simple solution for.
Breaking Down the General Problem
THE STRING
The generalized problem is such that there is a string that contains several characters.
THE SUB-STRING
Within the string is a sub-string made up of a few characters. Often times this is a file extension (i.e .c, .ts, or .json), or a top level domain (i.e. .com, .org, or .io), but it could be something as arbitrary as MC Donald's Mulan Szechuan Sauce. The point it is, it may not always be something simple.
THE BEFORE VARIANCE (Most important part)
The before variance is an arbitrary character, or characters, that always comes just before the sub-string. In this question, the before variance is an unknown amount of white-space. Its a variance because the amount of white-space that needs to be match against varies (or has a dynamic quantity).
Describing the Solution in Reference to the Problem
(Solution Part 1)
Often times when working with regular expressions its necessary to work in reverse.
We will start at the end of the problem described above, and work backwards, henceforth; we are going to start at the The Before Variance (or #3)
So, as mentioned above, The Before Variance is an unknown amount of white-space. We know that it includes white-space, but we don't know how much, so we will use the meta sequence for Any Whitespce with the one or more quantifier.
The Meta Sequence for "Any Whitespace" is \s.
The "One or More" quantifier is +
so we will start with...
NOTE: In ECMAS Regex the / characters are like quotes around a string.
const regex = /\s+/g
I also included the g to tell the engine to set the global flag to true. I won't explain flags, for the sake of brevity, but if you don't know what the global flag does, you should DuckDuckGo it.
(Solution Part 2)
Remember, we are working in reverse, so the next part to focus on is the Sub-string. In this question it is .com, but the author may want it to match against a value with variance, rather than just the static string of characters .com, therefore I will talk about that more below, but to stay focused, we will work with .com for now.
It's necessary that we use a concept here that's called ZERO LENGTH ASSERTION. We need a "zero-length assertion" because we have a sub-string that is significant, but is not what we want to match against. "Zero-length assertions" allow us to move the point in the string where the regular expression engine is looking at, without having to match any characters to get there.
The Zero-Length Assertion that we are going to use is called LOOK AHEAD, and its syntax is as follows.
Look-ahead Syntax: (?=Your-SubStr-Here)
We are going to use the look ahead to match against a variance that comes before the pattern assigned to the look-ahead, which will be our sub-string. The result looks like this:
const regex = /\s+(?=\.com)/gi
I added the insensitive flag to tell the engine to not be concerned with the case of the letter, in other words; the regular expression /\s+(?=\.cOM)/gi
is the same as /\s+(?=\.Com)/gi, and both are the same as: /\s+(?=\.com)/gi &/or /\s+(?=.COM)/gi. Everyone of the "Just Listed" regular expressions are equivalent so long as the i flag is set.
That's it! The link HERE (REGEX101) will take you to an example where you can play with the regular expression if you like.
I mentioned above working with a sub-string that has more variance than .com.
You could use (\s*)(?=\.\w{3,}) for instance.
The problem with this regex, is even though it matches .txt, .org, .json, and .unclepetespurplebeet, the regex isn't safe. When using the question's string of...
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
as an example, you can see at the LINK HERE (Regex101) there are 3 lines in the string. Those lines represent areas where the sub-string's lookahead's assertion returned true. Each time the assertion was true, a possibility for an incorrect final match was created. Though, only one match was returned in the end, and it was the correct match, when implemented in a program, or website, that's running in production, you can pretty much guarantee that the regex is not only going to fail, but its going to fail horribly and you will come to hate it.
You can try this. It will capture the last white space segment - in the first capture group.
(\s+)\.[^\.]*$