How to capture recursive groups in a regex? - regex

I am trying to capture a pattern which can appear multiple times in a regex in different groups. The pattern which can appear multiple times is :
(\b\\d{4}\\s*\\d{4}\\s*\\d{4}\\s*\\d{4}\b\\s*)
Please see complete test#here!
The expected output should be :
Full Match:
Group1:1111 1111 1111 1111
Group2:2222 2222 2222 2222
... GroupN...
how can this be achieved ?

If I understand the problem correctly, we would be wishing for matching a four-digits and space pattern being repeated three times, followed by another four-digits, and we can likely start with a simple expression such as:
(\d{4}\s)\1\1(\d{4}\s?)
Demo 1
Or if we would be matching a four-digits pattern four times, and space three times, we would likely start with this expression:
(\d{4})(\s+)\1\2\1\2\1
Demo 2
RegEx Circuit
jex.im visualizes regular expressions:

Use:
(?:<select\b|\G).*?(\b\d{4}(?:\s*\d{4}){3}\b)(?=.*?</select>)
Demo
Explanation:
(?: # non capture group
<select\b # literally
| # OR
\G # restart from previous match position
) # end group
.*? # 0 or more any character, you may use [\s\S]*?
( # start group 1
\b # word boundary
\d{4} # 4 digits
(?: # non capture group
\s* # 0 or more spaces
\d{4} # 4 digits
){3} # end group, may appear 3 times
\b # word boundary
) # end group 1
(?= # lookahead, make sure we have aftre:
.*? # 0 or more any character
</select> # end tag
) # end lookahead
Sample code (php):
preg_match_all('~(?:<select\b|\G).*?(\b\d{4}(?:\s*\d{4}){3}\b)(?=.*?</select>))~', $html, $matches);
print_r($matches[1]);

Related

Regex allow only one dash or only one space

I want an expression that allows number and one dash OR number and one space. Space or dash are optional.
I tried this
/^([0-9]+(-[0-9]+)?)|([0-9]+(\s[0-9]+)?)$/
Accepted regular expressions:
11-222
444 99
You can put the OR in the middle of your expression: ^([0-9]+)(\s|-)([0-9]+)$ works with your examples in Notepad++.
Let's explain your regex.
^ # beginning of line
( # start group 1
[0-9]+ # 1 or more digits
( # start group 2
- # a hyphen
[0-9]+ # 1 or more digits
)? # end group 2, optional
) # end group 1
| # OR
( # start group 3
[0-9]+ # 1 or more digits
( # start group 4
\s # a space
[0-9]+ # 1 or more digits
)? # end group 4, optional
) # end group 3
$ # end of line
The OR acts between the group 1 at the beginning of the line and the group 3 at the end of the line. But you want group 1 and group 3 anchored at the beginning and at the end.
Add a group over group 1 and 3:
^(([0-9]+(-[0-9]+)?)|([0-9]+(\s[0-9]+)?))$
You can use non capture groups (more efficient) instead of capture group
^(?:(?:[0-9]+(?:-[0-9]+)?)|(?:[0-9]+(?:\s[0-9]+)?))$
Combine the hyphen and the space in a character class and remove the superfluous groups:
^[0-9]+(?:[-\s][0-9]+)?$
If your regex flavour supports it, change the [0-9] into \d. Finally your regex becomes:
^\d+(?:[-\s]\d+)?$
Much simpler, no?

Using regex on a file to pull data out. Having issues with multi-line

I am looking to get to the next line of data within a text file. Here is an example of data from the file I am working with.
0519 ABF 244 AN A1 ADV STUFF 1.0 2.0 Somestuff 018 0155 MTWTh 10:30A 11:30A 20 20 0 6.7
Somestuff 011 0145 MTWTh 12:30P 1:30P
I have been trying to move to the next line by utilizing a variety of code such as.. carriage return \n using \s+ to replace the large space after 6.7. using m like so //m not finding a result just yet.
Here is some example code
while !regex_file.eof?
line = regex_file.gets.chomp
if line =~ ^.*?\d{4}\s+[A-Z]+\s+\d{3}.+$
puts line
end
end
Using https://rubular.com/ this particular set of code matches my desired output for the first line
0519 ABF 244 AN A1 ADV STUFF 1.0 2.0 Somestuff 018 0155 MTWTh 10:30A 11:30A 20 20 0 6.7
but does not match and haven't figured out how to match the next line.
Somestuff 011 0145 MTWTh 12:30P 1:30P
Try something like this: the \n captures the new line, and you can apply your own rules to capture anything you want which comes after \n - see below pls:
^.*\d{4}\s+[A-Z]+\s+\d{3}.+\n.*$
I've made an arbitrary assumption about the requirements for matching the second line. It is more demanding than the requirements for matching the first that are reflected in your regex, but I thought the additional complexity would have some educational value for you.
Here is a regular expression (untested) for matching both lines. Note you don't need ^.*? at the beginning of the regex and for the part of the regex that matches the first line .+$ adds nothing, so I removed it. After all you are just matching each line separately (line), and will display the entire line if there's a match. As well, the end-of-string anchor \z is more appropriate than the end-of-line anchor ($), though either can be used.
r = /
(?: # begin non-capture group
\d{4} # match 4 digits
\s+ # match > 0 whitespaces
[A-Z]+ # match > 0 uppercase letters
\s+ # match > 0 whitespaces
\d{3} # match 3 digits
| # or
\b # match a (zero-width) word break
[A-Z] # match 1 uppercase letter
[a-z]* # match >= 0 lowercase letter
\s+ # match > 0 whitespaces
\d{3} # match 3 digits
\s+ # match > 0 whitespaces
\d{4} # match 4 digits
\s+ # match > 0 whitespaces
[A-Za-z]+ # match > 0 letters
(?: # begin non-capture group
\s+ # match > 0 whitespaces
(?: # begin a non-capture group
0\d # match 0 followed by any digit
| # or
1[012] # match 1 followed by 0, 1 or 2
) # end non-capture group
: # match a colon
[0-5][0-9] # match 0-5 followed by 0-9
){2} # end non-capture group and execute twice
) # end non-capture group
/x # free-spacing regex definition mode
This regular expression is conventionally written as follows.
r = /(?:\d{4}\s+[A-Z]+\s+\d{3}|\b[A-Z][a-z]*\s+\d{3}\s+\d{4}\s+[A-Za-z]+(?:\s+(?:0\d|1[012]):[0-5][0-9]){2})/
You might go through the file putsing matching lines as follows:
File.foreach(fname) { |line| puts line if line.match? r }
See IO::foreach, which is a very convenient method for reading files line-by-line. Note IO class methods (such foreach) are commonly invoked with File as their receiver. That's OK, as File.superclass #=> IO, so File inherits those methods from IO.
When used without a block foreach returns an enumerator, which is often convenient as well. If, for example, you wished to return an array of matching lines (rather than puts them), you could write:
File.foreach(fname).with_object([]) do |line, arr|
arr << line.chomp if line.match? r
end
Your current regex:
^.*?\d{4}\s+[A-Z]+\s+\d{3}.+$
matches in this order:
the beginning of the line (^)
zero or more characters non-greedy .*?
four digits (\d{4})
one or more spaces (\s+)
one or more capital letters ([A-Z]+)
one or more spaces
three digits (\d{3})
one or more characters (.+)
the end of the line ($)
The second line of your file is:
Somestuff 011 0145 MTWTh 12:30P 1:30P
starts matching 0145 MTWT but then fails to match \d{3}

Regex to find match any combination of 3 terms

I need to design a regex which will match any combination of n words, without duplicates.
E.g. the regex for the words "she" "is" "happy" would match "she is happy", "happy she is" but not "she is is happy" or "she is".
Can I do this with Regex for should I use a custom algorithm?
This match she is happy in any order but not duplicate word:
^(?=(?:(?!\bshe\b).)*\bshe\b(?:(?!\bshe\b).)*$)(?=(?:(?!\bis\b).)*\bis\b(?:(?!\bis\b).)*$)(?=(?:(?!\bhappy\b).)*\bhappy\b(?:(?!\bhappy\b).)*$).*$
DEMO
Let's explain the first part (i.e. (?=(?:(?!\bshe\b).)*\bshe\b(?:(?!\bshe\b).)*$))
This make sure we have one and only one "she" anywhere in the phrase.
(?= # start lookahead
(?: # non capture group
(?!\bshe\b) # negative lookahead, make sure we don't have "she"
. # any character
)* # end group, may appear 0 or more times
\bshe\b # literally "she" surounded by word boundaries
(?: # non capture group
(?!\bshe\b) # negative lookahead, make sure we don't have "she"
. # any character
)* # end group, may appear 0 or more times
$
)
Same explanation for the other words "is" and "happy".

Get Number Grouping with Regex

I have strings that look like this:
get_a_string_A14_for_1.23.87.19_A12_and_others
get_a_string_A14_for_1.23.827.19_A12_and_others
get_a_string_A14_for_1.23.87.1_A12_and_others
get_a_string_A14_for_2.23.87.19_A12_and_others
I want to pull the numbers 1.23.87.19, 1.23.827.19, 1.23.87.1, and 2.23.87.19. The numbers will change, but this is the basic structure of the numbers.
I have tried doing:
([0-9]\.[0-9])
[0-9]\.[0-9]{1,4}
[0-9]\.[0-9]\.[0-9]{1,4}
[0-9]\.[0-9]\.[0-9]
And more, but have not had any luck. Can someone please help, and explain what I need to do to get these number groupings?
You can use this regex:
[0-9]+(?:\.[0-9]+)+
RegEx Demo
RegEx Breakup:
[0-9]+ # Match 1 OR more digits
(?: # start of non-capturing group
\. # match a literal dot
[0-9]+ # Match 1 OR more digits
) # group close
(?:\.[0-9]+)+ # Match 1 OR more of the expression in the group

Regex to fail if multiple matches found

Take the following regex:
P[0-9]{6}(\s|\.|,)
This is designed to check for a 6 digit number preceded by a "P" within a string - works fine for the most part.
Problem is, we need the to fail if more than one match is found - is that possible?
i.e. make Text 4 in the following screenshot fail but still keep all the others failing / passing as shown:
(this RegEx is being executed in a SQL .net CLR)
If the regex engine used by this tool is indeed the .NET engine, then you can use
^(?:(?!P[0-9]{6}[\s.,]).)*P[0-9]{6}[\s.,](?:(?!P[0-9]{6}[\s.,]).)*$
If it's the native SQL engine, then you can't do it with a single regex match because those engines don't support lookaround assertions.
Explanation:
^ # Start of string
(?: # Start of group which matches...
(?!P[0-9]{6}[\s.,]) # unless it's the start of Pnnnnnn...
. # any character
)* # any number of times
P[0-9]{6}[\s.,] # Now match Pnnnnnn exactly once
(?:(?!P[0-9]{6}[\s.,]).)* # Match anything but Pnnnnnn
$ # until the end of the string
Test it live on regex101.com.
or use this pattern
^(?!(.*P[0-9]{6}[\s.,]){2})(.*P[0-9]{6}[\s.,].*)$
Demo
basically check if the pattern exists and not repeated twice.
^ Start of string
(?! Negative Look-Ahead
( Capturing Group \1
. Any character except line break
* (zero or more)(greedy)
P "P"
[0-9] Character Class [0-9]
{6} (repeated {6} times)
[\s.,] Character Class [\s.,]
) End of Capturing Group \1
{2} (repeated {2} times)
) End of Negative Look-Ahead
( Capturing Group \2
. Any character except line break
* (zero or more)(greedy)
P "P"
[0-9] Character Class [0-9]
{6} (repeated {6} times)
[\s.,] Character Class [\s.,]
. Any character except line break
* (zero or more)(greedy)
) End of Capturing Group \2
$ End of string