Regex - Remove the final character - regex

I have the following Regex
(?:(?:zero|one|two|three|four|five|six|seven|eight|nine|\[0-9‌​\])\s*){4,}
As you can see, it matches numbers with whitespace.
Question
How do I stop it from matching the final whitespace character?
For example:
1 2 3 4 5<whitespace>
should rather be:
1 2 3 4 5

The way you wrote the regex, trailing whitespaces will always be a part of a match, and there is no way to get rid of them. You need to rewrite the pattern repeating the number matching part inside a group that you need to assign the limiting quantifier with the min value decremented.
Schematically, it looks like
<NUMPATTERN>(?:\s+<NUMPATTERN>){3,}
See the regex demo.
In PCRE and Ruby, you may repeat capture group patterns with (?n) syntax (to shorten the pattern):
(zero|one|two|three|four|five|six|seven|eight|nine|[0-9])(?:\s+\g<1>){3,}
See the regex demo

Related

Regular Expression: Find a specific group within other groups in VB.Net

I need to write a regular expression that has to replace everything except for a single group.
E.g
IN
OUT
OK THT PHP This is it 06222021
This is it
NO MTM PYT Get this content 111111
Get this content
I wrote the following Regular Expression: (\w{0,2}\s\w{0,3}\s\w{0,3}\s)(.*?)(\s\d{6}(\s|))
This RegEx creates 4 groups, using the first entry as an example the groups are:
OK THT PHP
This is it
06222021
Space Charachter
I need a way to:
Replace Group 1,2,4 with String.Empty
OR
Get Group 3, ONLY
You don't need 4 groups, you can use a single group 1 to be in the replacement and match 6-8 digits for the last part instead of only 6.
Note that this \w{0,2} will also match an empty string, you can use \w{1,2} if there has to be at least a single word char.
^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$
^ Start of string
\w{0,2}\s\w{0,3}\s\w{0,3}\s Match 3 times word characters with a quantifier and a whitespace in between
(.*?) Capture group 1 match any char as least as possible
\s\d{6,8} Match a whitespace char and 6-8 digits
\s? Match an optional whitespace char
$ End of string
Regex demo
Example code
Dim s As String = "OK THT PHP This is it 06222021"
Dim result As String = Regex.Replace(s, "^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$", "$1")
Console.WriteLine(result)
Output
This is it
My approach does not work with groups and does use a Replace operation. The match itself yields the desired result.
It uses look-around expressions. To find a pattern between two other patterns, you can use the general form
(?<=prefix)find(?=suffix)
This will only return find as match, excluding prefix and suffix.
If we insert your expressions, we get
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6}\s?)
where I simplified (\s|) as \s?. We can also drop it completely, since we don't care about trailing spaces.
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6})
Note that this works also if we have more than 6 digits because regex stops searching after it has found 6 digits and doesn't care about what follows.
This also gives a match if other things precede our pattern like in 123 OK THT PHP This is it 06222021. We can exclude such results by specifying that the search must start at the beginning of the string with ^.
If the exact length of the words and numbers does not matter, we simply write
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+)
If the find part can contain numbers, we must specify that we want to match until the end of the line with $ (and include a possible space again).
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+\s?$)
Finally, we use a quantifier for the 3 ocurrences of word-space:
(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)
This is compact and will only return This is it or Get this content.
string result = Regex.Match(#"(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)").Value;

Regex to extract values from look behind groups along with subsequent repetitions

In a JAVA program, I need to match a text input with a regular expression pattern. Simplistically, the text input looks like this: "100|200|123,124,125".
The output from the above match should find three matches, where all matches will return the two fixed subgroups - 100 and 200 and the variable repeating sub-group 123/124/125.
Match 1 - 123
Match 2 - 124
Match 3 - 125.
Each of these match output should also include 100 and 200 in two separate groups.
So basically, matches will target extracting patterns such as '100|200|123', '100|200|124', '100|200|125'.
I have used this regex: (?<=(?:(?<first>\d+)\|(?<second>\d+)\|)|,)(?<vardata>\d+)(?=,|$).
But I get this error: + A quantifier inside a look-behind makes it non-fixed width
As stated in comments above, you cannot use variable length assertions in lookbehind in Java regex.
However you can use this regex based on \G:
(?:(\d+)\|(\d+)\||(?<!^)\G,)(\d+)
RegEx Demo
RegEx Details:
\G asserts position at the end of the previous match or the start of the string for the first match.
You will get comma separated numbers in group(3) in a loop while group(1) and group(2) will give you first 2 numbers from input string.

Elastic search regex to get last 7 digits from right

I have data indexed in this format 676767 2343423 2344444 32494444. I need a regular expression to pattern anlayser last 7 digits from right. Ex output: 2494444. Pattern which we have tried [0-9]{7} which is not working.
In ElasticSearch, the pattern is anchored by default. That means, you cannot rely on partial matches, you need to match the entire string and capture the last consecutive 7 digits.
Use
.*([0-9]{7})
where
.* - will match any 0+ chars other than newline (as many as possible) and then will backtrack to match...
([0-9]{7}) - 7 digits placed into Capture group 1.
The Sense plug-in returns the captured value if a capturing group is defined in the regular expression pattern, so, no additional extraction work (or group accessing work) needs to be done.

regex: match number sequences without matching previous matches

I am looking through number sequences of 3 comma-delimited values and want to search for any sequence of 1,2,3. I want to match 1,2,3; 3,2,1; 2,1,3; etc. I do NOT want to match 1,1,1; 1,2,2; 1,3,3; 3,3,1; 2,3,3; using regexr.com for my regex parsing.
[123],[123],[123]
is what I started with until I realized it matched any character and not sequence of characters.
I was researching positive/negative lookaheads but could not think of how to structure it logically so the regex would not match a previously matched number in the specified sequence.
What fundamental thing am I missing here?
You can use a lookahead and back-reference based regex:
([123]),((?!\1)[123]),((?!\1|\2)[123])
RegEx Demo
RegEx Breakup:
([123]) # match 1 or 2 or 3 and capture it in group #1
, # match literal comma
((?!\1)[123]) # match 1 or 2 or 3 if it is NOT same as group #1 & capture it in group #2
, # match literal comma
((?!\1|\2)[123]) # match 1 or 2 or 3 if it is NOT same as group #1 OR #2
Answer#1 is #anubhava's solution, his solution correctly matches any sequence as long as all 3 integers are unique. However, in a situation where the sequence to search for has 2 repeated integers, you use the following regex, assuming your sequence to search for is 1,2,2. Can't believe I made it this hard :P
((1,2,2)|(2,1,2)|(2,2,1))
I realised that in a situation of 2 repeated integers, only 3 possible matches are available. So, instead of trying to build a complex lookbehind/lookahead regex we can simply search for those three occurrences literally. Using the capture groups should tag what it matched.
Obviously, in a sequence of 3 repeated integers such as 3,3,3 there is only one possible match so you search for it literally.

Limit number of character of capturing group

Let's say i have this text : "AAAA1 AAA11 AA111AA A1111 AAAAA AAAA1111".
I want to find all occurrences matching these 3 criteria :
-Capital letter 1 to 4 times
-Digit 1 to 4 times
-Max number of characters to be 5
so the matches would be :
{"AAAA1", "AAA11", "AA111", "A1111", "AAAA1"}
i tried
([A-Z]{1,4}[0-9]{1,4}){5}
but i knew it would fail, since it's looking for five time my group.
Is there a way to limit result of the groups to 5 characters?
Thanks
You can limit the character count with a look ahead while checking the pattern with you matching part.
If you can split the input by whitespace you can use:
^(?=.{2,5}$)[A-Z]{1,4}[0-9]{1,4}$
See demo here.
If you cannot split by whitespace you can use capturing group with (?:^| )(?=.{2,5}(?=$| ))([A-Z]{1,4}[0-9]{1,4})(?=$| ) for example, or lookbehind or \K to do the split depending on your regex flavor (see demo).
PREVIOUS ANSWER, wrongly matches A1A1A, updated after #a_guest remark.
You can use a lookahead to check for your pattern, while limiting the character count with the matching part of the regex:
(?=[A-Z]{1,4}[0-9]{1,4}).{2,5}
See demo here.