Regexr expression doesn't work in groovy - regex

I'm looking to get 4 digits that will be surrounded by spaces.
e.g. foo 2420 blah
using regexr i got this pattern \b\d{4}\b
i translated this to groovy as
def courseNum = course.text =~ $/\b\d{4}\b/$
System.out.print(courseNum.group())
this is returning no matches even though I am positive the string does contain 4 digits by themselves.
What am i doing wrong?

The .group() you are using causes the java.lang.IllegalStateException: No match found exception. You just need to access the match value via the 0th index, courseNum[0].
Also, I would use a simple slashy string here, since it is enough and convenient enough to define a regular expression.
def text = "New 7234 pcs"
def courseNum = text =~ /\b\d{4}\b/
print(courseNum[0])
See this Groovy demo
However, since you want to get 4 digits that will be surrounded by spaces, you do not have to rely on \b word boundaries, use lookarounds to require string start/end or whitespace around the 4 digits:
/(?<!\S)\d{4}(?!\S)/
See the regex demo.

Another good way to do this is with the findAll(regex) method.
​def text = "CSE 2443, MATH 5003"
text.findAll(/\b\d{4}\b/).each {
println it
}
Resulting in ([2443, 5003])
2443
5003
Even if it doesn't match, it will not error like your current instantiation. Find all basically returns all matches as an array list and is therefore safer.

Related

Find all groups of 9 digits (\d{9}) up to a certain word

I have the following string extracted from a PDF file and I would like to obtain the nine digits "control class" number from it:
string = ‘(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)’
I want all the matches that occur before the word “Sector”, otherwise I will have undesired matches.
I’m using the “re” module, in Python 3.8.
I tried to use the negative lookbehind as follows:
(?<!Sector:)\d{9})
However, it didn’t work. I still had the matches like ‘54177846’ and ‘201874249’, which are after the ‘Sector’ word.
I also tried to “isolate” the search area between the words “Process ID” and “Sector”:
(Process ID:.*?)(\d{9})(.*Sector)
I also tried to search for the expression \d9 only up to the “Sector” word, but it returned no results.
I had to work a solution around, in two steps: (1) I created a regex that would find all the results up to the word “Sector” (desperate_regex = ‘(.*)Sector)’ and assigned it to a new variable,partial_text`; (2) I then searched for the desired regex ('\d{9}') within the new variable.
My code is working, but it does not satisfies me. How would I find my matches with a single regex search?
Please note that the first "control class" number is truncated with the text that comes before it ("CONTROL CLASS706345519").
(PS: I'm a totally newbie, and this is my first post. I hope I could explain my self. Thank you!)
The easiest way is to get the string before Sector and just search that:
split_string, _ = string.split("Sector")
nums = re.findall(r'\d{9}', split_string)
# ['706345519', '708393673', '706855190']
Another would be to use the third-party regex module, which allows overlapping matches:
import regex as re
nums = re.findall(r'(\d{9}).*?Sector', string, overlapped=True)
# ['706345519', '708393673', '706855190']
The regex described below may be more overkill then required for the actual case being handled, but better safe than sorry.
If you want match a string of exactly 9 digits, no more no fewer, then you should you negative lookbehind and lookahead assertions to ensure that the 9 digits are not preceded nor followed by another digit (again, in this case perhaps the OP knows that only 9-digit numbers will ever appear and this is overkill). You can also use a negative lookbehind assertion to ensure that Sector does not appear before the 9 digits. This later assertion is a variable length assertion requiring the regex package from PyPI:
r'(?<!Sector.*?)(?<!\d)\d{9}(?!\d)'
(?<!Sector.*? Assert that we haven't scanned past Sector. This handles the situation where Sector might appear multiple times in the input by ensuring that we never scan past the first occurrence.
(?<!\d) Assert that the previous character is not a digit.
\d{9} Match 9 digits.
(?!\d) Assert that the next character is not a digit.
The simplified version:
r'(?<!Sector.*?)\d{9}'
The code:
import regex as re
string = '(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)'
#print(re.findall(r'(?<!Sector.*?)\d{9}', string))
print(re.findall(r'(?<!Sector.*?)(?<!\d)\d{9}(?!\d)', string))
Prints:
['706345519', '708393673', '706855190']
You could use an alternation and break if you find "Sector":
import re
text = """(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)"""
rx = re.compile(r'\d{9}|(Sector)')
results = []
for match in rx.finditer(text):
if match.group(1):
break
results.append(match.group(0))
print(results)
Which yields
['706345519', '708393673', '706855190']
If either of these work I'll add an explaination to it:
[\s\S]+(?:Process ID:\s+)(.*)(?:\s+Sector)[\s\S]+
\g<1>
Or this?
(?i)[\s\S]+(?:control\s+class\s*)(\d{9})[\s\S]+
\g<1>

Parse string using regex

I need to come up with a regular expression to parse my input string. My input string is of the format:
[alphanumeric].[alpha][numeric].[alpha][alpha][alpha].[julian date: yyyyddd]
eg:
A.A2.ABC.2014071
3.M1.MMB.2014071
I need to substring it from the 3rd position and was wondering what would be the easiest way to do it.
Desired result:
A2.ABC.2014071
M1.MMB.2014071
(?i) will be considered as case insensitive.
(?i)^[a-z\d]\.[a-z]\d\.[a-z]{3}\.\d{7}$
Here a-z means any alphabet from a to z, and \d means any digit from 0 to 9.
Now, if you want to remove the first section before dot, then use this regex and replace it with $1 (or may be \1)
(?i)^[a-z\d]\.([a-z]\d\.[a-z]{3}\.\d{7})$
Another option is replace below with empty:
(?i)^[a-z\d]\.
If the input string is just the long form, then you want everything except the first two characters. You could arrange to substitute them with nothing:
s/^..//
Or you could arrange to capture everything except the first two characters:
/^..(.*)/
If the expression is part of a larger string, then the breakdown of the alphanumeric components becomes more important.
The details vary depending on the language that is hosting the regex. The notations written above could be Perl or PCRE (Perl Compatible Regular Expressions). Many other languages would accept these regexes too, but other languages would require tweaks.
Use this regex:
\w.[A-Z]\d.[A-Z]{3}.\d{7}
Use the above regex like this:
String[] in = {
"A.A2.ABC.2014071", "3.M1.MMB.2014071"
};
Pattern p = Pattern.compile("\\w.[A-Z]\\d.[A-Z]{3}.\\d{7}");
for (String s: in ) {
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println("Result: " + m.group().substring(2));
}
}
Live demo: http://ideone.com/tns9iY

Regular expression which will match if there is no repetition

I would like to construct regular expression which will match password if there is no character repeating 4 or more times.
I have come up with regex which will match if there is character or group of characters repeating 4 times:
(?:([a-zA-Z\d]{1,})\1\1\1)
Is there any way how to match only if the string doesn't contain the repetitions? I tried the approach suggested in Regular expression to match a line that doesn't contain a word? as I thought some combination of positive/negative lookaheads will make it. But I haven't found working example yet.
By repetition I mean any number of characters anywhere in the string
Example - should not match
aaaaxbc
abababab
x14aaaabc
Example - should match
abcaxaxaz
(a is here 4 times but it is not problem, I want to filter out repeating patterns)
That link was very helpful, and I was able to use it to create the regular expression from your original expression.
^(?:(?!(?<char>[a-zA-Z\d]+)\k<char>{3,}).)+$
or
^(?:(?!([a-zA-Z\d]+)\1{3,}).)+$
Nota Bene: this solution doesn't answer exaactly to the question, it does too much relatively to the expressed need.
-----
In Python language:
import re
pat = '(?:(.)(?!.*?\\1.*?\\1.*?\\1.*\Z))+\Z'
regx = re.compile(pat)
for s in (':1*2-3=4#',
':1*1-3=4#5',
':1*1-1=4#5!6',
':1*1-1=1#',
':1*2-a=14#a~7&1{g}1'):
m = regx.match(s)
if m:
print m.group()
else:
print '--No match--'
result
:1*2-3=4#
:1*1-3=4#5
:1*1-1=4#5!6
--No match--
--No match--
It will give a lot of work to the regex motor because the principle of the pattern is that for each character of the string it runs through, it must verify that the current character isn't found three other times in the remaining sequence of characters that follow the current character.
But it works, apparently.

Using Regex is there a way to match outside characters in a string and exclude the inside characters?

I know I can exclude outside characters in a string using look-ahead and look-behind, but I'm not sure about characters in the center.
What I want is to get a match of ABCDEF from the string ABC 123 DEF.
Is this possible with a Regex string? If not, can it be accomplished another way?
EDIT
For more clarification, in the example above I can use the regex string /ABC.*?DEF/ to sort of get what I want, but this includes everything matched by .*?. What I want is to match with something like ABC(match whatever, but then throw it out)DEF resulting in one single match of ABCDEF.
As another example, I can do the following (in sudo-code and regex):
string myStr = "ABC 123 DEF";
string tempMatch = RegexMatch(myStr, "(?<=ABC).*?(?=DEF)"); //Returns " 123 "
string FinalString = myStr.Replace(tempMatch, ""); //Returns "ABCDEF". This is what I want
Again, is there a way to do this with a single regex string?
Since the regex replace feature in most languages does not change the string it operates on (but produces a new one), you can do it as a one-liner in most languages. Firstly, you match everything, capturing the desired parts:
^.*(ABC).*(DEF).*$
(Make sure to use the single-line/"dotall" option if your input contains line breaks!)
And then you replace this with:
$1$2
That will give you ABCDEF in one assignment.
Still, as outlined in the comments and in Mark's answer, the engine does match the stuff in between ABC and DEF. It's only the replacement convenience function that throws it out. But that is supported in pretty much every language, I would say.
Important: this approach will of course only work if your input string contains the desired pattern only once (assuming ABC and DEF are actually variable).
Example implementation in PHP:
$output = preg_replace('/^.*(ABC).*(DEF).*$/s', '$1$2', $input);
Or JavaScript (which does not have single-line mode):
var output = input.replace(/^[\s\S]*(ABC)[\s\S]*(DEF)[\s\S]*$/, '$1$2');
Or C#:
string output = Regex.Replace(input, #"^.*(ABC).*(DEF).*$", "$1$2", RegexOptions.Singleline);
A regular expression can contain multiple capturing groups. Each group must consist of consecutive characters so it's not possible to have a single group that captures what you want, but the groups themselves do not have to be contiguous so you can combine multiple groups to get your desired result.
Regular expression
(ABC).*(DEF)
Captures
ABC
DEF
See it online: rubular
Example C# code
string myStr = "ABC 123 DEF";
Match m = Regex.Match(myStr, "(ABC).*(DEF)");
if (m.Success)
{
string result = m.Groups[1].Value + m.Groups[2].Value; // Gives "ABCDEF"
// ...
}

Regular expression help in Perl

I have following text pattern
(2222) First Last (ab-cd/ABC1), <first.last#site.domain.com> 1224: efadsfadsfdsf
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
I want the number 1224 or 1234, 4657 from the above text after the text >.
I have this
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain.com>\s\d+:
which will take the text before : But i want the one after email till :
Is there any easy regular expression to do this? or should I use split and do this
Thanks
Edit: The whole text is returned by a command line tool.
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
(3333) - Unique ID
First Last - First and last names
<first.last#site.domain.com> - Email address in format FirstName.LastName#sub.domain.com
1234, 4567 - database primary Keys
: xxxx - Headline
What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables
The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.
My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this
you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:
/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/ (not tested!)
there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".
Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.
Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).
You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:
sub extract_nums {
my $string = shift;
if ($string =~ /<[^>]*> *([\d, ]+):/) {
return $1 =~ /\d+/g; # return the extracted digits in a list
# return $1; # just return the string as-is
} else { return undef }
}
This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.
There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?
(1234) First Last (ab-cd/ABC1), <foo.bar#domain.com> 1224: efadsfadsfdsf
(1234) First Last (abcd/ABC12), <foo.bar#domain.com> 1234, 4657: efadsfadsfdsf
If so, this should work:
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain\.com>\s\d+(?:,\s(\d+))?:
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;
That's it.
Tested.
With number catching:
$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;
Not tested but you get the idea.