Regular Expression pattern explanation - regex

Dono what this regular expression is doing
(?>[^\,]*\,){3}([^\,]*)[\']?
(?>[^\,]*\,){4}([^\,]*)[\']?
could any one explain me more in deatil

There is an awesome site http://regex101.com for these needs! It describes regulars and allows you to test and debug them.
Your ones does match things like 4 (5 for the second one) values separated by commas and returns the last one as a signle matching group:
(?>...) are atomic groups. After they have matched once they won't leave it forever.
[^\,] matches any character except comma
[^\,]*\, means any number (even zero) of non-comma charaters, and then a sigle comma
(?>[^\,]*\,){3} means do that happend above 3 times
([^\,]*)[\']? means one more word without commas as a group and possibly one more comma.
For example, in 1,,333,4,5 the first one will match 1,,333,4, and return 4 as matched group. The second one will find 1,,333,4,5 and 5 as group.
Edit: Even more description.
Regular expression have groups. These are parts or regular expressions that can have number quantifiers -- how many times to repeat them ({3}) and some options. Also, after regular has matched, we can find out what every group has matched.
Atomic ones, less talk, take as much forward as they can and never go back. Also, they can't be watched as described before. They are used here only due to perfomance reasons.
So, we need to take as a group the 4th word from comma-separated values. We will do it like this:
We will take 3 times ({3}) an atomic group ((?>...)):
Which takes a word -- any number of characters (*) of any non-comma character ([^\n])
[^...] means any symbol except described ones.
And a comma (\,) that separates that word from the next one
Now, our wanted word starts. We open a group ((...))
That will take a word as described above: [^\,]*
The is possibly one more comma, take it if there is one (\,? or [\,]?)
? means 0 or 1 group before, here it's single comma.
So, it starts on first word in first atomic group, takes it all, then takes a comma. After that, it is repeated 2 times more. That takes 3 first words with their commas.
After that, one non-atomic group takes the 4th word.

Related

How do I create a regex expression that does not allow the same 9 duplicate numbers in a social security number, with or without hyphens?

The first thing I tried to do, is get the regex matching what I DON'T want. This way, I could just flip it to NOT accept that same input. This is where I came up with the first part of this regex.
Accept all 9 digit numbers, where all 9 digits are identical (without dashes): "^(\d)\1{8}$". This expression works as expected (as seen here: (https://regex101.com/r/Ez8YC3/1)).
The second expression should do the same, with dashes formatted as follows xxx-xx-xxxx: "^(\d)\1{8}$". This expressions works as expected (as seen here: https://regex101.com/r/bodzIX/1).
Now what I want to do at this point, is combine them together to look for BOTH conditions. However when I do that it seems to break, and only match 9 digit numbers that are identical throughout WITH dashes: "^(\d)\1{2}-(\d)\1{1}-(\d)\1{3}$|^(\d)\1{8}$". This can be seen here: https://regex101.com/r/lPnksf/1.
I may be getting a little ahead of myself here, but in order to show my work as much as possible, I also tried flipping those regex separately, which also did not work as expected.
Condition #1 flipped: "^(?!(\d)\1{8})$". Can be seen here: https://regex101.com/r/ed51yk/1.
Condition #2 flipped: "^(?!(\d)\1{2}-(\d)\1{1}-(\d)\1{3})$". Can be seen here: https://regex101.com/r/UYfoMK/1.
I would expect the two expressions (when flipped) to match any 9 digit number (with or without dashes) where all numbers are not identical. How ever this does not happen at all.
This is the final regex that I came up with, which is clearly not doing what I would expect it to: "^(?!(\d)\1{2}-(\d)\1{1}-(\d)\1{3})$|^(?!(\d)\1{8})$". Can be seen here: https://regex101.com/r/9eHhF5/1
At the end of the day, I want to combine these 2 expressions, with this one (that already works as intended): "^(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d\d\d\d$". Can be seen here: https://regex101.com/r/AdRI8i/1.
I am still pretty new to regex, and really want to understand why I can't simply wrap the condition in (?!...) in order to match the opposite condition.
Thank you in advance
What you want to do is not flip, but reverse the regex logic.
Yes, to reverse the pattern logic, you should use a negative lookahead, but there are caveats.
First, the $ end of string anchor: if it was at the end of the "positive" regex, it must also be moved to the lookahead in the reverse pattern. So, your ^(?!(\d)\1{8})$ regex must be written as ^(?!(\d)\1{8}$). Same goes for your second regex.
Next, mind that each subsequent capturing group gets an incremented ID number, so you cannot keep the same backreferences when you "join" patterns with OR | operator. You must adjust these IDs to reflect their new values in the new regex.
So, you want to match a string that matches ^(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d\d\d\d$ first (let's note \d\d\d\d = \d{4}), then you can add restrictions with lookaheads:
(?!(\d)\1{8}$) - fails the match if, immediately from the current position, it matches identical 9 digits and then the string end comes
(?!(\d)\2\2-(\d)\2-(\d)\2{3}$) - (note the ID incrementing continuation) fails the match if, immediately from the current position, it matches identical to the first one 3 digits, -, identical 2 digits, -, identical 5 digits, and then the string end comes.
So, to follow your logic, you can use
^(?!(\d)\1{8}$)(?!(\d)\2\2-(\d)\2-(\d)\2{3}$)(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d{4}$
See the regex demo
As the lookaheads are non-consuming patterns, i.e. the regex index remains at the same position after matching their pattern sequences where it was before, the 3 lookaheads will all be tried at the start of the string (see the ^ anchor). If any of the three negative lookaheads at the start fails, the whole string match will be failed right away.
By this Regex you match what you dont want as social security number:
^(?:(\d)\1{8})|(?:(\d)\2{2}-\2{2}-\2{4})$
Demo
By this regex you match only what you want:
^(?!(?:(\d)\1{8})|(?:(\d)\2{2}-\2{2}-\2{4})).*$
Demo

Regex search string contains with specific count of letter

I am trying to workout the regex for searching string which satisfies count of letters where not in specific order
such as:
AAABBBCCCDDD
BBBAAADDDCCC
CCCAAABBBDDD
are TRUE:
so far, I have got A{3}B{3}C{3}D{3} would matches the first line, but for other lines would be needing different order.
is there any great solution that would work out?
You can match and capture a letter, then backreference that captured character. Repeat the whole thing as many times as needed, which looks to be 4 here:
(?:([A-Z])\1{2}){4}
https://regex101.com/r/vrQVgD/1
If the same character can't appear as a sequence more than once, I don't think this can be done in such a DRY manner, you'll need separate capture groups:
([A-Z])\1{2}(?!\1)([A-Z])\2{2}(?!\1|\2)([A-Z])\3{2}(?!\1|\2|\3)([A-Z])\4{2}
https://regex101.com/r/vrQVgD/2
which is essentially 4 of a variation on the below put together:
(?!\1|\2|\3)([A-Z])\4{2}
The (?!\1|\2|\3) checks that the next character hasn't occurred in any of the previously matched capture groups.

Email-similar regex catastrophic backtracing

I'd like to match something which may be called the beginning of the e-mail, ie.
1 character (whichever letter from alphabet and digits)
0 or 1 dot
1 or more character
The repetition of {2nd and 3rd point} zero or more times
# character
The regex I've been trying to apply on Regex101 is \w(\.?\w+)*#.
I am getting the error Catastrophic backtracking. What am I doing wrong? Is the regex correct?
It is usual for catastrophic backtracking to appear in cases of nested quantifiers when the group inside contains at least one optional subpattern, making the quantified subpattern match the same pattern as the subpattern before the outer group and the outer group is not at the end of the pattern.
Your regex causes the issue right because the (\.?\w+)* is not at the end, there is an optional \.? and the expression is reduced to \w(\w+)*#.
For example aaa.aaaaaa.a.aa.aa but now aaa..aaaa.a
What you need is
^\w+(?:\.\w+)*#
See the regex demo
^ - start of string (to avoid partial matches)
\w+ - 1 or more word chars
(?:\.\w+)* - zero or more sequences of:
\. - a literal dot
\w+ - 1 or more word chars
# - a literal # char.
The problem
"Catastrophic backtracing" occurs when a part of the string could match a part of the regex in many different ways, so it needs to repeatedly retry to determine whether or not the string actually matches. A simple case: The regex a+a+b to match two or more a followed by one b. If you were to run that on aaaaaaaaaaa, the problem arises: First, the first a+ matches everything, and it fails at the second a+. Then, it tries with the first a+ matching all but one a, and the second a+ matches one a (this is "backtracing"), and then it fails on the b. But regexes aren't "smart" enough to know that it could stop there - so it has to keep going in that pattern until it's tried every split of giving some as to the first and some to the second. Some regex engines will realize they're getting stuck like this, and quit after enough steps, with the error you saw.
For your specific pattern: what you have there matches any nonzero quantity of letters or digits, mixed with any quantity of . where the . cannot be first, followed by an #. The only additional limit is that there can't be two adjacent dots. Effectively, this is the same case as my example: The * applied to a section containing a + acts like multiple duplicates of that +-ed section.
Atomic grouping
You could try something with atomic grouping. That basically says "once you've found any match for this, don't backtrace into it". After all, if you've found some amount of /w, it's not going to contain a /. and there's no need to keep rechecking that - dots are not letters or digits, and neither of those is an #.
In this case, the result would be \w(?>\.?\w+)*#. Note that not all regex engines support atomic grouping, though the one you linked does. If the string is only a match, nothing will change - if it's not a match, or contains non-matches, the process will take fewer steps. Using #eddiem's example from the comments, it finds two matches in 166311 steps with your original, but only takes 623 steps with atomic grouping added.
Possessive quantifiers
Another option would be a possessive quantifier - \w(\.?\w+)*+# means roughly the same thing. *+, specifically, is "whatever the star matches, don't backtrace inside it". In the above case, it matches in 558 steps - but it's slightly different meaning, in that it treats all the repeats together as one atomic value, instead of as several distinct atomic values. I don't think there's a difference in this case, but there might be in some cases. Again, not supported by all regex engines.

Regex how to match two similar numbers in separate match groups?

I got the following string:
[13:49:38 INFO]: Overall : Mean tick time: 4.126 ms. Mean TPS:
20.000
the bold numbers should be matched, each into its own capture group.
My current expression is (\d+.\d{3}) which matches 4.126 how can I match my 20.000 now into a second capture group? Adding the same capture group again makes it find nothing. So what I basically need is, "search for first number, then ignore everything until you find next digit."
You could use something like so: (\d+\.\d{3}).+?(\d+\.\d{3})$ (example here) which essentially is your regex (plus a minor fix) twice, with the difference that it will also look for the same pattern again at the end of the string.
Another minor note, your regex contains, a potential issue in which you are matching the decimal point with the period character. In regular expression language, the period character means any character, thus your expression would also match 4s222. Adding an extra \ in front makes the regex engine treat is as an actual character, and not a special one.

eclipse - regex: replace multiple group

someText
1
2
3
4
moreText
I would like to add a prefix before each digit.
but using (\w+\R)(\d+\R)+(\w+) and \1prefix\2\3 will only prefix the last digit and erase the others.
Is there a way to do it with a single regex or should i write a script on the side?
The problem with your regex is the use of greedy matching in the (\d+\R)+, specifically the last +. That reads, "match this group as many times as you can so long as it doesn't cause the miss of a match". So for your text it gobbles up 1, 2, 3, and 4 before it can't gobble any more and puts the last match into the second capture group. Obviously, it's in the nature of regex engines to be unable to express variadic groups, how would you address them anyway? So the short answer, I think is that regexes are the wrong tool for a fully automated process and you'll have to write a script.
However, for a slightly less automated process that still incorporates your surrounding text, you could try
find: (\w+\R)((?:\d+\R)+)(\w+)
replace: \1prefix\2\3
We wrap the second group plus it's greedy modifier in an extra set of capturing parens and enclose the actual matching text in a non-capturing group. Now, we have the full set of digits in their own group and can add the prefix to the first one. The interesting side effect of this is that the first number then matches the first group (\w+\R) and if you run the find/replace again it hits the next number in the line until it no longer matches.
This way, you should be able to run through your files at least only hitting the areas you are interested in adding this prefix to and it shouldn't take nearly as long as finding every digit in every file.