Could anyone help me in understanding the groovy script below - regex

I am trying to decode some groovy script. I was able to figure out that it is a regular expression but couldn't figure out what the code is exactly.
def dirNumber = this.'Directory Number'
dirNumber?"61" + (dirNumber =~ /0([0-9]+)/)[0][1] + "#":null

According to Regular expression operators section of https://groovy-lang.org/operators.html, =~ is the find operator, which creates a java.util.regex.Matcher for pattern to the right matching them on string to the left.
So, dirNumber =~ /0([0-9]+)/ is equivalent to
Pattern.compile("/0([0-9]+)/").matcher(dirNumber) and evaluates to an instance of java.util.regex.Matcher.
Groovy gives you the ability to access matches by index ([0] in your code); your regular expression uses grouping, so in each match you can access groups by (1-based: 0 denotes the entire pattern) index ([1] in your code), too.
So, your code (if dirNumber is not null) extracts the first group of the first match of the regular expression.
// EDITED
You can get an IndexOutOfBoundsException when the first index ([0] in your code) is out of the matches' range; when your second index ([1] in you code) is of the grous' range, you get a null without exception accessing the group through index...
You can test these scenarios with a simplified version of the code above:
def dirNumber = "Xxx 15 Xxx 16 Xxx 17"
def matcher = dirNumber =~ /Xxx ([0-9]+)/
println(matcher[0][1]) // 15
println(matcher[0][5]) // null
println(matcher[5][1]) // IndexOutOfBoundsException

Related

Regex - number of characters for sequence

I have the following pattern:
<tag-2>B1</tag-2>
<tag-3>A12</tag-3>
<tag-4>M123</tag-4>
//etc
There is always one letter and digits.
I need to create a regex which uses number from the tag and applies it to the sequence between tags. I know that I can use a backreference but I don't know how to construct a regex. Here is incomplete regex:
"^<tag-([2-9])>[A-Z][0-9]/*how to apply here number from the tag ?*/</tag-\\1>$"
Edit
The following strings are not matched:
<tag-2>11</tag-2> //missing letter
<tag-2>BB</tag-2> // missing digit
<tag-3>B123</tag-3> //too many digits
<tag-3>AA1</tag-3> //should be only one letter and two digits
<tag-4>N12</tag-4> //too few digits
Regular expressions cannot contain elements that are functions of the values of back-references (other than the back-references themselves). That's because regular expressions are static from the time they are constructed.
One could, however, extract the desired string, or conclude that the sting contains no valid substring, in two steps. First attempt to match the string against /<tag-(\d+)>, where the contents of the capture group, after being converted to an integer, equals the length of the string that begins with a capital letter and is followed by digits. That information can then be used to construct a second regular expression that is used to verify the remainder of the match and extract the desired string.
I will use Ruby to illustrate how that might be done here. The operations--and certainly the two regular expressions--should be clear even to readers who are not familiar with Ruby.
Code
R = /<tag-(\d+)>/ # a constant
def doit(str)
m = str.match(R) # obtain a MatchData object; else nil
return nil if m.nil? # finished if no match
n = m[1].to_i-1 # required number of digits
r = /\A\p{Lu}\d{#{n}}(?=<\/tag-#{m[1]}>)/
# regular expression for second match
str[m.end(0).to_i..-1][r] # extract the desired string; else nil
end
Examples
arr = <<_.each_line.map(&:chomp)
<tag-2>B1</tag-2>
<tag-3>A12</tag-3>
<tag-4>M123</tag-4>
<tag-2>11</tag-2>
<tag-2>BB</tag-2>
<tag-3>B123</tag-3>
<tag-3>AA1</tag-3>
<tag-4>N12</tag-4>
_
#=> ["<tag-2>B1</tag-2>", "<tag-3>A12</tag-3>",
# "<tag-4>M123</tag-4>", "<tag-2>11</tag-2>",
# "<tag-2>BB</tag-2>", "<tag-3>B123</tag-3>",
# "<tag-3>AA1</tag-3>", "<tag-4>N12</tag-4>"]
arr.map do |line|
s = doit(line)
s = 'nil' if s.nil?
puts "#{line.ljust(22)}: #{s}"
end
<tag-2>B1</tag-2> : B1
<tag-3>A12</tag-3> : A12
<tag-4>M123</tag-4> : M123
<tag-2>11</tag-2> : nil
<tag-2>BB</tag-2> : nil
<tag-3>B123</tag-3> : nil
<tag-3>AA1</tag-3> : nil
<tag-4>N12</tag-4> : nil
Explanation
Note that (?=<\/tag-#{m[1]}>) (part of r in the body of the method) is a positive lookahead, meaning that "<\/tag-#{m[1]}>" (with #{m[1]} substituted out) must be matched, but is not part of the match that is returned.
The step-by-step calculations are as follows.
str = "<tag-2>B1</tag-2>"
m = str.match(R)
#=> #<MatchData "<tag-2>" 1:"2">
m[0]
#=> "<tag-2>" (match)
m[1]
#=> "2" (contents of capture group 1)
m.end(0)
#=> 7 (index of str where the match ends, plus 1)
m.nil?
#=> false (do not return)
n = m[1].to_i-1
#=> 1 (number of digits required)
r = /\A\p{Lu}\d{#{n}}(?=\<\/tag\-#{m[1]}\>)/
#=> /\A\p{Lu}\d{1}(?=\<\/tag\-2\>)/
s = str[m.end(0).to_i..-1]
#=> str[7..-1]
#=> "B1</tag-2>"
s[r]
#=> "B1"
It looks like you're trying to create a pattern that will interpret a number in order to determine how long a string should be. I don't know of any feature to automate this process in any regular expression engine, but it can be done in a more manual fashion by enumerating all cases which you wish to handle.
For example, tags 2 through 9 can be handled as such:
<tag-2>: ^<tag-2>[A-Z][0-9]</tag-2>$
<tag-3>: ^<tag-3>[A-Z][0-9]{2}</tag-3>$
<tag-4>: ^<tag-4>[A-Z][0-9]{3}</tag-4>$
<tag-5>: ^<tag-5>[A-Z][0-9]{4}</tag-5>$
<tag-6>: ^<tag-6>[A-Z][0-9]{5}</tag-6>$
<tag-7>: ^<tag-7>[A-Z][0-9]{6}</tag-7>$
<tag-8>: ^<tag-8>[A-Z][0-9]{7}</tag-8>$
<tag-9>: ^<tag-9>[A-Z][0-9]{8}</tag-9>$
By removing the grouping and back-references you eliminate some complications that can occur when trying to combine regular expression patterns and can produce the following:
^(<tag-2>[A-Z][0-9]</tag-2>|<tag-3>[A-Z][0-9]{2}</tag-3>|<tag-4>[A-Z][0-9]{3}</tag-4>|<tag-5>[A-Z][0-9]{4}</tag-5>|<tag-6>[A-Z][0-9]{5}</tag-6>|<tag-7>[A-Z][0-9]{6}</tag-7>|<tag-8>[A-Z][0-9]{7}</tag-8>|<tag-9>[A-Z][0-9]{8}</tag-9>)$

Exclude words that contain my regular expression but are not my regular expression

I am trying to find a way of excluding the words that contain my regular expression, but are not my regular expression using the search method of a Text widget object. For example, suppose I have this regular expression "(if)|(def)", and words like define, definition or elif are all found by the re.search function, but I want a regular expression that finds exactly just if and def.
This is the code I am using:
import keyword
PY_KEYS = keyword.kwlist
PY_PATTERN = "^(" + ")|(".join(PY_KEYS) + ")$"
But it is still taking me words like define, but I want just words like def, even if define contains def.
I need this to highlight words in a tkinter.Text widget. The function I am using which is responsible for highlight the code is:
def highlight(self, event, pattern='', tag=KW, start=1.0, end="end", regexp=True):
"""Apply the given tag to all text that matches the given pattern
If 'regexp' is set to True, pattern will be treated as a regular
expression.
"""
if not isinstance(pattern, str) or pattern == '':
pattern = self.syntax_pattern # PY_PATTERN
# print(pattern)
start = self.index(start)
end = self.index(end)
self.mark_set("matchStart", start)
self.mark_set("matchEnd", start)
self.mark_set("searchLimit", end)
count = tkinter.IntVar()
while pattern != '':
index = self.search(pattern, "matchEnd", "searchLimit",
count=count, regexp=regexp)
# prints nothing
print(self.search(pattern, "matchEnd", "searchLimit",
count=count, regexp=regexp))
if index == "":
break
self.mark_set("matchStart", index)
self.mark_set("matchEnd", "%s+%sc" % (index, count.get()))
self.tag_add(tag, "matchStart", "matchEnd")
On the other hand, if PY_PATTERN = "\\b(" + "|".join(PY_KEYS) + ")\\b", then it highlights nothing, and you can see, if you put a print inside the function, that it's an empty string.
You can use anchors:
"^(?:if|def)$"
^ asserts position at the start of the string, and $ asserts position at the end of the string, asserting that nothing more can be matched unless the string is entirely if or def.
>>> import re
for foo in ["if", "elif", "define", "def", "in"]:
bar = re.search("^(?:if|def)$", foo)
print(foo, ' ', bar);
... if <_sre.SRE_Match object at 0x934daa0>
elif None
define None
def <_sre.SRE_Match object at 0x934daa0>
in None
You could use word boundaries:
"\b(if|def)\b"
The answers given are ok for Python's regular expression, but I have found in the meantime that the search method of a tkinter Text widget uses actually the Tcl's regular expressions style.
In this case, instead of wrapping the word or the regular expression with \b or \\b (if we are not using a raw string), we can simply use the corresponding Tcl word boundaries character, that is \y or \\y, which did the job in my case.
Watch my other question for more information.

Python RegEx query missing overlapping substrings

Python3.3, OS X 7.5
I am attempting to locate all instances of a 4-character substring defined as follows:
First character = 'N'
Second character = Anything but 'P'
Third character = 'S' or 'T'
Fourth character = Anything but 'P'
My query looks like this:
re.findall(r"\N[A-OQ-Z][ST][A-OQ-Z]", text)
This is working except in one particular case where two substrings overlap. That case involves the following 5character substring:
'...NNTSY...'
The query catches the first 4-character substring ('NNTS'), but not the second 4-character substring ('NTSY').
This is my first attempt at regular expressions, and obviously I'm missing something.
You can do this if the re engine does not consume characters as it matches them, which is possible with lookahead assertions:
import re
text = '...NNTSY...'
for m in re.findall(r'(?=(N[A-OQ-Z][ST][A-OQ-Z]))', text):
print(m)
Output:
NNTS
NTSY
Having everything within the assertion works but also feels weird. Another way is taking the N out of the assertion:
for m in re.findall(r'(N(?=([A-OQ-Z][ST][A-OQ-Z])))', text):
print(''.join(m))
From the Python 3 documentation (emphasis added):
$ python3 -c 'import re; help(re.findall)'
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
If you want overlapping instances, use regex.search() in a loop. You have to compile the regular expression because the API for non-compiled regular expressions doesn't take a parameter to specify the starting position.
def findall_overlapping(pattern, string, flags=0):
"""Find all matches, even ones that overlap."""
regex = re.compile(pattern, flags)
pos = 0
while True:
match = regex.search(string, pos)
if not match:
break
yield match
pos = match.start() + 1
(N[^P](?:S|T)[^P])
Edit live on Debuggex

Regex to match the longest repeating substring

I'm writing regular expression for checking if there is a substring, that contains at least 2 repeats of some pattern next to each other. I'm matching the result of regex with former string - if equal, there is such pattern. Better said by example: 1010 contains pattern 10 and it is there 2 times in continuous series. On other hand 10210 wouldn't have such pattern, because those 10 are not adjacent.
What's more, I need to find the longest pattern possible, and it's length is at least 1. I have written the expression to check for it ^.*?(.+)(\1).*?$. To find longest pattern, I've used non-greedy version to match something before patter, then pattern is matched to group 1 and once again same thing that has been matched for group1 is matched. Then the rest of string is matched, producing equal string. But there's a problem that regex is eager to return after finding first pattern, and don't really take into account that I intend to make those substrings before and after shortest possible (leaving the rest longest possible). So from string 01011010 I get correctly that there's match, but the pattern stored in group 1 is just 01 though I'd except 101.
As I believe I can't make pattern "more greedy" or trash before and after even "more non-greedy" I can only come whit an idea to make regex less eager, but I'm not sure if this is possible.
Further examples:
56712453289 - no pattern - no match with former string
22010110100 - pattern 101 - match with former string (regex resulted in 22010110100 with 101 in group 1)
5555555 - pattern 555 - match
1919191919 - pattern 1919 - match
191919191919 - pattern 191919 - match
2323191919191919 - pattern 191919 - match
What I would get using current expression (same strings used):
no pattern - no match
pattern 2 - match
pattern 555 - match
pattern 1919 - match
pattern 191919 - match
pattern 23 - match
In Perl you can do it with one expression with help of (??{ code }):
$_ = '01011010';
say /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
Output:
101
What happens here is that after a matching consecutive pair of substrings, we make sure with a negative lookahead that there is no longer pair following it.
To make the expression for the longer pair a postponed subexpression construct is used (??{ code }), which evaluates the code inside (every time) and uses the returned string as an expression.
The subexpression it constructs has the form .+?(..{N,})\1, where N is the current length of the first capturing group (length($^N), $^N contains the current value of the previous capturing group).
Thus the full expression would have the form:
(?=(.+)\1)(?!.+?(..{N,})\2}))
With the magical N (and second capturing group not being a "real"/proper capturing group of the original expression).
Usage example:
use v5.10;
sub longest_rep{
$_[0] =~ /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
}
say longest_rep '01011010';
say longest_rep '010110101000110001';
say longest_rep '2323191919191919';
say longest_rep '22010110100';
Output:
101
10001
191919
101
You can do it in a single regex, you just have to pick the longest match from the list of results manually.
def longestrepeating(strg):
regex = re.compile(r"(?=(.+)\1)")
matches = regex.findall(strg)
if matches:
return max(matches, key=len)
This gives you (since re.findall() returns a list of the matching capturing groups, even though the matches themselves are zero-length):
>>> longestrepeating("yabyababyab")
'abyab'
>>> longestrepeating("10100101")
'010'
>>> strings = ["56712453289", "22010110100", "5555555", "1919191919", 
               "191919191919", "2323191919191919"]
>>> [longestrepeating(s) for s in strings]
[None, '101', '555', '1919', '191919', '191919']
Here's a long-ish script that does what you ask. It basically goes through your input string, shortens it by one, then goes through it again. Once all possible matches are found, it returns one of the longest. It is possible to tweak it so that all the longest matches are returned, instead of just one, but I'll leave that to you.
It's pretty rudimentary code, but hopefully you'll get the gist of it.
use v5.10;
use strict;
use warnings;
while (<DATA>) {
chomp;
print "$_ : ";
my $longest = foo($_);
if ($longest) {
say $longest;
} else {
say "No matches found";
}
}
sub foo {
my $num = shift;
my #hits;
for my $i (0 .. length($num)) {
my $part = substr $num, $i;
push #hits, $part =~ /(.+)(?=\1)/g;
}
my $long = shift #hits;
for (#hits) {
if (length($long) < length) {
$long = $_;
}
}
return $long;
}
__DATA__
56712453289
22010110100
5555555
1919191919
191919191919
2323191919191919
Not sure if anyone's thought of this...
my $originalstring="pdxabababqababqh1234112341";
my $max=int(length($originalstring)/2);
my #result;
foreach my $n (reverse(1..$max)) {
#result=$originalstring=~m/(.{$n})\1/g;
last if #result;
}
print join(",",#result),"\n";
The longest doubled match cannot exceed half the length of the original string, so we count down from there.
If the matches are suspected to be small relative to the length of the original string, then this idea could be reversed... instead of counting down until we find the match, we count up until there are no more matches. Then we need to back up 1 and give that result. We would also need to put a comma after the $n in the regex.
my $n;
foreach (1..$max) {
unless (#result=$originalstring=~m/(.{$_,})\1/g) {
$n=--$_;
last;
}
}
#result=$originalstring=~m/(.{$n})\1/g;
print join(",",#result),"\n";
Regular expressions can be helpful in solving this, but I don't think you can do it as a single expression, since you want to find the longest successful match, whereas regexes just look for the first match they can find. Greediness can be used to tweak which match is found first (earlier vs. later in the string), but I can't think of a way to prefer an earlier, longer substring over a later, shorter substring while also preferring a later, longer substring over an earlier, shorter substring.
One approach using regular expressions would be to iterate over the possible lengths, in decreasing order, and quit as soon as you find a match of the specified length:
my $s = '01011010';
my $one = undef;
for(my $i = int (length($s) / 2); $i > 0; --$i)
{
if($s =~ m/(.{$i})\1/)
{
$one = $1;
last;
}
}
# now $one is '101'

2-step regular expression matching with a variable in Perl

I am looking to do a 2-step regular expression look-up in Perl, I have text that looks like this:
here is some text 9337 more text AA 2214 and some 1190 more BB stuff 8790 words
I also have a hash with the following values:
%my_hash = ( 9337 => 'AA', 2214 => 'BB', 8790 => 'CC' );
Here's what I need to do:
Find a number
Look up the text code for the number using my_hash
Check if the text code appears within 50 characters of the identified number, and if true print the result
So the output I'm looking for is:
Found 9337, matches 'AA'
Found 2214, matches 'BB'
Found 1190, no matches
Found 8790, no matches
Here's what I have so far:
while ( $text =~ /(\d+)(.{1,50})/g ) {
$num = $1;
$text_after_num = $2;
$search_for = $my_hash{$num};
if ( $text_after_num =~ /($search_for)/ ) {
print "Found $num, matches $search_for\n";
}
else {
print "Found $num, no matches\n";
}
This sort of works, except that the only correct match is 9337; the code doesn't match 2214. I think the reason is that the regular expression match on 9337 is including 50 characters after the number for the second-step match, and then when the regex engine starts again it is starting from a point after the 2214. Is there an easy way to fix this? I think the \G modifier can help me here, but I don't quite see how.
Any suggestions or help would be great.
You have a problem with greediness. The 1,50 will consume as much as it can. Your regex should be /(\d+)(.+?)(?=($|\d))/
To explain, the question mark will make the multiple match non-greedy (it will stop as soon as the next pattern is matched - the next pattern gets precedence). The ?= is a lookahead operator to say "check if the next element is a digit. If so, match but do not consume." This allows the first digit to get picked up by the beginning of the regex and be put into the next matched pattern.
[EDIT]
I added an optional end value to the lookahead so that it wouldn't die on the last match.
Just use :
/\b\d+\b/g
Why match everything if you don't need to? You should use other functions to determine where the number is :
/(?=9337.{1,50}AA)/
This will fail if AA is further than 50 chars away from the end of 9337. Of course you will have to interpolate your variables to match your hashe's keys and values. This was just an example for your first key/value pair.