Lookaround assertions in Perl - regex

im confused what is the use of these lookaround assertions in perl?
example this one:
(?=pattern)
or the positive lookahead. So here's my questions:
How are these useful? what sort of instances they are used?
And related to question 1, why would i want to look ahead of the regex pattern? isnt it more work? looking ahead and then executing the pattern matching again.
I need a very clear example if possible. Thanks

To uppercase what's in between commas, you could use:
(my $x = 'a,b,c,d,e') =~ s/(?<=,)([^,]*)(?=,)/ uc($1) /eg; # a,B,C,D,e
a,b,c,d,e
Pass 1 matches -
Pass 2 matches -
Pass 3 matches -
If you didn't use lookarounds, this is what you'd get,
(my $x = 'a,b,c,d,e') =~ s/,([^,]*),/ ','.uc($1).',' /eg; # a,B,c,D,e
a,b,c,d,e
Pass 1 matches ---
Pass 2 matches ---
Not only does the lookahead avoid repetition, it doesn't work without it!
Another somewhat common use is as part of a string equivalent to [^CHAR].
foo(?:(?!foo|bar).)*bar # foo..bar, with no nested foo or bar
You can use it to narrow down character classes.
\w(?<!\d) # A word char that's not a digit.
Although this can now be done using (?[ ... ]).
It's also useful in more esoteric patterns.
/a/ && /b/ && /c/
can be written as
/^(?=.*?a)(?=.*?b).*?c/s

lookahead lets you check for a pattern without actually matching it.
When you do a(?=b) ,you would match a if its followed by b. Note:it doesn't match b.
So,
1>You can extract hello(without #) from #hello# using
(?<=#)hello(?=#)
2>You can validate passwords with requirements such as a password must have 2 digits,2 letters or more with any other character
^(?=(.*\d){2})(?=(.*[a-z]){2}).*$
Try doing above without lookahead ,you would realize it's importance

I have found lookaheads especially useful for checking multiple conditions. For example, consider a regex that checks that a password has at least one lowercase, one uppercase, one numeric, and one symbol character, and is at least 8 characters in length:
^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[^a-zA-Z0-9]).{8,}$
Try to devise a regex to do the same thing without lookahead assertions! It's possible, but it's extremely cumbersome.
Meanwhile, I've found lookbehinds especially useful for checking boundary conditions—that is, for example, matching a string of 0's, unless it's preceded by another number, like 1000067.
These are my experiences but certainly there are many more practical uses and the way everyone uses a tool can vary from person to person.

There are many reasons to use lookarounds, e.g.
limiting the substring that is considered to be matched: s/(?<=[0-9])+(?=[0-9])/-/ instead of s/([0-9])+([0-9])/$1-$2/.
and-ing various conditions together: /(?=\p{Uppercase}\p{Lowercase})\p{InBasicLatin}{2,}/.

Lookaround assertions is useful when you need a pattern to help locate the match but you don't want the pattern to be part of what is captured.
Here's a simple scenario with lookahead assertion:
Let's say I have
my $text = '98 degrees, 99 Red Balloons, 101 Dalmatians'
and I want to change the number of red balloons from its previous value to 9001, so I use
$text =~ s/\d+(?=Red Balloons)/9001/;

Related

Regular Expression to exclude numerical emailids

I have below set of sample emailids
EmailAddress
1123
123.123
123_123
123#123.123
123#123.com
123#abc.com
123mbc#abc.com
123mbc#123abc.com
123mbc#123abc.123com
123mbc123#cc123abc.c123com
Need to eliminate mailids if they contain entirely numericals before #
Expected output:
123mbc#abc.com
123mbc#123abc.com
123mbc#123abc.123com
123mbc123#cc123abc.c123com
I used below Java Rex. But its eliminating everything. I have basic knowledge in writing these expressions. Please help me in correcting below one. Thanks in advance.
[^0-9]*#.*
do you mean something like this ? (.*[a-zA-Z].*[#]\w*\.\w*)
breakdown .* = 0 or more characters [a-zA-Z] = one
letter .* = 0 or more characters #
\w*\.\w* endless times a-zA-Z0-9 with a single . in between
this way you have the emails that contains at least one letter
see the test at https://regex101.com/r/qV1bU4/3
edited as suggest by ccf with updated breakdown
The following regex only lets email adresses pass that meet your specs:
(?m)^.*[^0-9#\r\n].*#
Observe that you have to specify multi-line matching ( m flag. See the live demo. The solution employs the embedded flag syntax m flag. You can also call Pattern.compile with the Pattern.MULTILINE argument. ).
Live demo at regex101.
Explanation
Strategy: Define a basically sound email address as a single-line string containing a #, exclude strictly numerical prefixes.
^: start-of-line anchor
#: a basically sound email address must match the at-sign
[^...]: before the at sign, one character must neither be a digit nor a CR/LF. # is also included, the non-digit character tested for must not be the first at-sign !
.*: before and after the non-digit tested for, arbitrary strings are permitted ( well, actually they aren't, but true syntactic validation of the email address should probably not happen here and should definitely not be regex based for reasons of reliability and code maintainability ). The strings need to be represented in the pattern, because the pattern is anchored.
Try this one:
[^\d\s].*#.+
it will match emails that have at least one letter or symbol before the # sign.

Interesting easy looking Regex

I am re-phrasing my question to clear confusions!
I want to match if a string has certain letters for this I use the character class:
[ACD]
and it works perfectly!
but I want to match if the string has those letter(s) 2 or more times either repeated or 2 separate letters
For example:
[AKL] should match:
ABCVL
AAGHF
KKUI
AKL
But the above should not match the following:
ABCD
KHID
LOVE
because those are there but only once!
that's why I was trying to use:
[ACD]{2,}
But it's not working, probably it's not the right Regex.. can somebody a Regex guru can help me solve this puzzle?
Thanks
PS: I will use it on MYSQL - a differnt approach can also welcome! but I like to use regex for smarter and shorter query!
To ensure that a string contains at least two occurencies in a set of letters (lets say A K L as in your example), you can write something like this:
[AKL].*[AKL]
Since the MySQL regex engine is a DFA, there is no need to use a negated character class like [^AKL] in place of the dot to avoid backtracking, or a lazy quantifier that is not supported at all.
example:
SELECT 'KKUI' REGEXP '[AKL].*[AKL]';
will return 1
You can follow this link that speaks on the particular subject of the LIKE and the REGEXP features in MySQL.
If I understood you correctly, this is quite simple:
[A-Z].*?[A-Z]
This looks for your something in your set, [A-Z], and then lazily matches characters until it (potentially) comes across the set, [A-Z], again.
As #Enigmadan pointed out, a lazy match is not necessary here: [A-Z].*[A-Z]
The expression you are using searches for characters between 2 and unlimited times with these characters ACDFGHIJKMNOPQRSTUVWXZ.
However, your RegEx expression is excluding Y (UVWXZ])) therefore Z cannot be found since it is not surrounded by another character in your expression and the same principle applies to B ([ACD) also excluded in you RegEx expression. For example Z and A would match in an expression like ZABCDEFGHIJKLMNOPQRSTUVWXYZA
If those were not excluded on purpose probably better can be to use ranges like [A-Z]
If you want 2 or more of a match on [AKL], then you may use just [AKL] and may have match >= 2.
I am not good at SQL regex, but may be something like this?
check (dbo.RegexMatch( ['ABCVL'], '[AKL]' ) >= 2)
To put it in simple English, use [AKL] as your regex, and check the match on the string to be greater than 2. Here's how I would do in Java:
private boolean search2orMore(String string) {
Matcher matcher = Pattern.compile("[ACD]").matcher(string);
int counter = 0;
while (matcher.find())
{
counter++;
}
return (counter >= 2);
}
You can't use [ACD]{2,} because it always wants to match 2 or more of each characters and will fail if you have 2 or more matching single characters.
your question is not very clear, but here is my trial pattern
\b(\S*[AKL]\S*[AKL]\S*)\b
Demo
pretty sure this should work in any case
(?<l>[^AKL\n]*[AKL]+[^AKL\n]*[AKL]+[^AKL\n]*)[\n\r]
replace AKL for letters you need can be done very easily dynamicly tell me if you need it
Is this what you are looking for?
".*(.*[AKL].*){2,}.*" (without quotes)
It matches if there are at least two occurences of your charactes sorrounded by anything.
It is .NET regex, but should be same for anything else
Edit
Overall, MySQL regular expression support is pretty weak.
If you only need to match your capture group a minimum of two times, then you can simply use:
select * from ... where ... regexp('([ACD].*){2,}') #could be `2,` or just `2`
If you need to match your capture group more than two times, then just change the number:
select * from ... where ... regexp('([ACD].*){3}')
#This number should match the number of matches you need
If you needed a minimum of 7 matches and you were using your previous capture group [ACDF-KM-XZ]
e.g.
select * from ... where ... regexp('([ACDF-KM-XZ].*){7,}')
Response before edit:
Your regex is trying to find at least two characters from the set[ACDFGHIJKMNOPQRSTUVWXZ].
([ACDFGHIJKMNOPQRSTUVWXZ]){2,}
The reason A and Z are not being matched in your example string (ABCDEFGHIJKLMNOPQRSTUVWXYZ) is because you are looking for two or more characters that are together that match your set. A is a single character followed by a character that does not match your set. Thus, A is not matched.
Similarly, Z is a single character preceded by a character that does not match your set. Thus, Z is not matched.
The bolded characters below do not match your set
ABCDEFGHIJKLMNOPQRSTUVWXYZ
If you were to do a global search in the string, only the italicized characters would be matched:
ABCDEFGHIJKLMNOPQRSTUVWXYZ

Regular expression for validating complicated username

So, the conditions are:
At least 1 character, max 20 characters
Starts with [a-zA-Z]
Contains [a-zA-Z0-9.-]
Ends with [a-zA-Z0-9]
My expression is:
^(?=[a-zA-Z])+[a-zA-Z0-9.-]*[a-zA-Z0-9]{1,20}$
It works nicely. However, it doesn't work properly with a username's length. I can enter a thirty-character username and still find a match. What's wrong with it?
I tend to find complicated regexps a poor choice when wanting to validate a string against multiple rules. They cause unreadable code that's difficult to maintain.
How about (in pseudocode)
.length >= 1 && .length <= 20
&& /^[a-z0-9.-]+$/i
&& /^[a-z]/i
&& /[a-z0-9]$/i
i.e. check the length, then check the legal character validity, then check the opening and closing characters, exactly as described in your question text.
You could also combine the first two lines so that you're only using regexps:
/^[a-z0-9.-]{1,20}$/i
&& /^[a-z]/i
&& /[a-z0-9]$/i
I'd be surprised if this was slower than a one-liner regexp, but it's certainly more readable.
If it contains only [a-zA-Z0-9.-], starts with [a-zA-Z] and ends with [a-zA-Z0-9], it doesn't start with [-0-9.] and doesn't end with [.-]
^(?![-0-9.])[a-zA-Z0-9.-]{1,20}(?<![.-])$
Note: Works only in regex flavors, that support negative lookbehind.
Test at regex101
Try this:
^[a-zA-Z]$|^(?=.{2,20}$)[a-zA-Z][a-zA-Z0-9.-]*[a-zA-Z0-9]$
You could use the below regex,
^(?=.{1,20}$)[a-zA-Z][a-zA-Z0-9.-]*[a-zA-Z0-9]$
DEMO
If the string does not start with [a-zA-Z] the regex will fail. The rest is easier to understand.
^(?=[a-zA-Z])[a-zA-Z0-9.-]{0,19}[a-zA-Z0-9]$
DEMO
The following is a fairly simple solution:
^[a-zA-Z]$|^[a-zA-Z]{1}[a-zA-Z0-9.-]{0,18}[a-zA-Z0-9]{1}$
Broken down:
Either: a single character in the group [a-zA-Z]
Or: Exactly one character in group [a-zA-Z], up to 18 characters in the group [a-zA-Z0-9.-] and finally 1 character from the group [a-zA-Z0-9].
Matches correctly against the following:
Valid
Valid.UserName
Valid1-1UserName
0-Invalid
Invalid.
Invalid-ThisIsTooLong
V

Pattern matching in Perl

I am doing pattern match for some names below:
ABCD123_HH1
ABCD123_HH1_K
Now, my code to grep above names is below:
($name, $kind) = $dirname =~ /ABCD(\d+)\w*_([\w\d]+)/;
Now, problem I am facing is that I get both the patterns that is ABCD123_HH1, ABCD123_HH1_K in $dirname. However, my variable $kind doesn't take this ABCD123_HH1_K. It does take ABCD123_HH1 pattern.
Appreciate your time. Could you please tell me what can be done to get pattern with _k.
You need to add the _K part to the end of your regex and make it optional with ?:
/ABCD(\d+)_([\w\d]+(_K)?)/
I also erased the \w*, which is useless and keeps you from correctly getting the HH1_K.
You should check for zero or more occurrences of _K.
* in Perl's regexp means zero or more times
+ means atleast one or more times.
Hence in your regexp, append (_K)*.
Finally, your regexp should be this:
/ABCD(\d+)\w*_([\w\d]+(_K)*)/
\w includes letters, numbers as well as underscores.
So you can use something as simple as this:
/ABCD\w+/

Regular Expression to match fractions and not dates

I'm trying to come up with a regular expression that will match a fraction (1/2) but not a date (5/5/2005) within a string. Any help at all would be great, all I've been able to come up with is (\d+)/(\d+) which finds matches in both strings. Thanks in advance for the help.
Assuming PCRE, use negative lookahead and lookbehind:
(?<![\/\d])(\d+)\/(\d+)(?![\/\d])
A lookahead (a (?=) group) says "match this stuff if it's followed by this other stuff." The contents of the lookahead aren't matched. We negate it (the (?!) group) so that it
doesn't match stuff after our fraction - that way, we don't match the group in what follows.
The complement to a lookahead is a lookbehind (a (?<=) group) does the opposite - it matches stuff if it's preceeded by this other stuff, and just like the lookahead, we can negate it (the (?<!) group) so that we can match things that don't follow something.
Together, they ensure that our fraction doesn't have other parts of fractions before or after it. It places no other arbitrary requirements on the input data. It will match the fraction 2/3 in the string "te2/3xt", unlike most of the other examples provided.
If your regex flavor uses //s to delimit regular expressions, you'll have to escape the slashes in that, or use a different delimiter (Perl's m{} would be a good choice here).
Edit: Apparently, none of these regexes work because the regex engine is backtracking and matching fewer numbers in order to satisfy the requirements of the regex. When I've been working on one regex for this long, I sit back and decide that maybe one giant regex is not the answer, and I write a function that uses a regex and a few other tools to do it for me. You've said you're using Ruby. This works for me:
>> def get_fraction(s)
>> if s =~ /(\d+)\/(\d+)(\/\d+)?/
>> if $3 == nil
>> return $1, $2
>> end
>> end
>> return nil
>> end
=> nil
>> get_fraction("1/2")
=> ["1", "2"]
>> get_fraction("1/2/3")
=> nil
This function returns the two parts of the fraction, but returns nil if it's a date (or if there's no fraction). It fails for "1/2/3 and 4/5" but I don't know if you want (or need) that to pass. In any case, I recommend that, in the future, when you ask on Stack Overflow, "How do I make a regex to match this?" you should step back first and see if you can do it using a regex and a little extra. Regular expressions are a great tool and can do a lot, but they don't always need to be used alone.
EDIT 2:
I figured out how to solve the problem without resorting to non-regex code, and updated the regex. It should work as expected now, though I haven't tested it. I also went ahead and escaped the /s since you're going to have to do it anyway.
EDIT 3:
I just fixed the bug j_random_hacker pointed out in my lookahead and lookbehind. I continue to see the amount of effort being put into this regex as proof that a pure regex solution was not necessarily the optimal solution to this problem.
Use negative lookahead and lookbehind.
/(?<![\/\d])(?:\d+)\/(?:\d+)(?![\/\d])/
EDIT: I've fixed my answer to trap for the backtracking bug identified by #j_random_hacker. As proof, I offer the following quick and dirty php script:
<?php
$subject = "The match should include 1/2 but not 12/34/56 but 11/23, now that's ok.";
$matches = array();
preg_match_all('/(?<![\/\d])(?:\d+)\/(?:\d+)(?![\/\d])/', $subject, $matches);
var_dump($matches);
?>
which outputs:
array(1) {
[0]=>
array(2) {
[0]=>
string(3) "1/2"
[1]=>
string(5) "11/23"
}
}
Lookahead is great if you're using Perl or PCRE, but if they are unavailable in the regex engine you're using, you can use:
(^|[^/\d])(\d+)/(\d+)($|[^/\d])
The 2nd and 3rd captured segments will be the numerator and denominator.
If you do use the above in a Perl regex, remember to escape the /s -- or use a different delimiter, e.g.:
m!(?:^|[^/])(\d+)/(\d+)(?:$|[^/])!
In this case, you can use (?:...) to avoid saving the uninteresting parenthesised parts.
EDIT 18/12/2009: Chris Lutz noticed a tricky bug caused by backtracking that plagues most of these answers -- I believe this is now fixed in mine.
if its line input you can try
^(\d+)\/(\d+)$
otherwise use this perhaps
^(\d+)\/(\d+)[^\\]*.
this will work: (?<![/]{1})\d+/\d+(?![/]{1})
Depending on the language you're working with you might try negative-look-ahead or look-behind assertions: in perl (?!pattern) asserts that /pattern/ can't follow the matched string.
Or, again, depending on the language, and anything you know about the context, a word-boundary match (\b in perl) might be appropriate.