Decoding regular expression in Perl - regex

Can anyone decode what this regular expression means in Perl:
while (/([0-9a-zA-Z\-]+(?:'[a-zA-Z0-9\-]+)*)/g)

Here is a breakdown of the regex:
( # start a capturing group (1)
[0-9a-zA-Z-]+ # one or more digits or letters or hyphens
(?: # start a non-capturing group
' # a literal single quote character
[a-zA-Z0-9-]+ # one or more digits or letters or hyphens
)* # repeat non-capturing group zero or more times
) # end of capturing group 1
The regex is in the form /.../g and in a while loop, which means that the code inside of the while will be run for each non-overlapping match of the regex.

There's a tool for that: YAPE::Regex::Explain
The regular expression:
(?-imsx:([0-9a-zA-Z\-]+(?:'[a-zA-Z0-9\-]+)*))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-9a-zA-Z\-]+ any character of: '0' to '9', 'a' to
'z', 'A' to 'Z', '\-' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[a-zA-Z0-9\-]+ any character of: 'a' to 'z', 'A' to
'Z', '0' to '9', '\-' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

F.J's answer is a perfect breakdown. But... he left out an important piece, which is the /g at the end. It tells the parser to continue where it left off from last time. So the while loop will continue to loop over the string repeatedly until it gets the the point where there are no other points that match.

Related

Printing in patterns in perl

I am having a great trouble to remove the errors in unicode encoded corpus.
In following form
രണവര്‍ഗ്ഗത്തിനകത്തു=ഭരണവര്‍ഗ്ഗത്തിന്:stemഅകത്തു|:suffix
ഭസ്മമാക്കിക്കളയുകയും=ഭസ്മം:stemആക്കിക്കളയുകയും|:suffix
ഭസ്മമാക്കി=ഭസ്മം:stemആക്കി|:suffix
ഭാഗത്തുനിന്നുണ്ടാകണം=ഭാഗത്ത്:stemനിന്ന്:stemഉണ്ടാകണം|:suffix,:
ഭാഗമായ=ഭാഗം:stemആയ|:suffix
ഭാര്യമാരില്‍നിന്നും=ഭാര്യമാരില്‍:stemനിന്നും|:suffix:suffix
ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:stemനിന്നു|:suffix,:suffix:suffix
ഭാര്യയായി=ഭാര്യ:stemആയി|:suffix
ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix:suffix
ഭിത്തികളൊക്കെ=ഭിത്തികള്‍:stemഒക്കെ|:suffix
ഭിന്നതയില്ലെന്നും=ഭിന്നത:stemഇല്ല:stemഎന്നും|:suffix,:suffix0
ഭൂപ്രഭുക്കളെന്ന്=ഭൂപ്രഭുക്കള്‍:stemഎന്ന്|:suffix0
ഭൂമിയില്‍നിന്ന്=ഭൂമിയില്‍:stemനിന്ന്|:suffix
ഭൂമിയിലുള്ള=ഭൂമിയില്‍:stemഉള്ള|:suffix
ഭൂമിയെപ്പോലൊരു=ഭൂമിയെ:stemപോലെ:stemഒരു|:suffix,:suffix0
ഭൂമുഖവീക്ഷണനായി=ഭൂമുഖവീക്ഷണന്‍:stemആയി|:suffix:suffix
ഭൂസഞ്ചാരംപോലെ=ഭൂസഞ്ചാരം:stemപോലെ|:suffix
ഭേദിക്കേണ്ടതായി=ഭേദിക്കേണ്ടതാ്:stemആയി|:suffix:suffix
ഭൗതികവാദികളാണ്=ഭൗതികവാദികള്‍:stemആണ്|:suffix0
മക്കളയച്ചു=മക്കള്‍:stemഅയച്ചു|:suffix
മക്കള്‍ക്കാണ്=മക്കള്‍ക്ക്:stemആണ്|:suffix
മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix:suffix
മഞ്ചേശ്വരത്താണ്=മഞ്ചേശ്വരത്ത്:stemആണ്|:suffix:suffix
മഞ്ഞുവെള്ളത്തിലാഴ്ത്തി=മഞ്ഞുവെള്ളത്തില്‍:stemആഴ്ത്തി|:suffix:suffix
മടങ്ങാണിതിന്=മടങ്ങ്:stemആണ്:stemഇതിന്|:suffix,:suffix
മടിയനായിരുന്നു=മടിയന്‍:stemആയിരുന്നു|:suffix
Where I need to remove two stem together and two suffixes together. In the case of two stems I need keep first stem and convert the second into suffix. In the case of two suffixes like this :suffix:suffix, :suffix,:suffix0 I need to keep only one suffix
use strict;
use warnings qw/ all FATAL /;
use List::Util 'reduce';
while ( <> ) {
my ($word, $ss) = / \( ( /[^()]* ) \) /gx;
my #ss = split ' ', $ss;
my $str = reduce { sprintf 'S (%s) (%s)', $a, $b } #ss;
printf "%s (%s)\n", $word, $str;
}
This is the perl code I am trying to change but that code is not sufficient to handle the complexities. Is there any way to handle the kinds of errors.
**Expected output**
`ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:stemനിന്നു|:suffix,:suffix:suffix` to
ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:suffixനിന്നു|:suffix
ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix:suffix to
ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix
മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix:suffix to
മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix
Any one interested in helping me?
Description
^([^:]+:stem[^:]+)(?::stem(?=.*?(:suffix))|)([^:]+?\|:suffix[^:]*)(?::suffix[^:]*)*$
Replace with: \1\2\3
This regular expression will do the following:
Assumes that each line will have a suffix string this is then pattern matched and pulled into the capture group 2
If there is a second stem it is replaced with suffix
Removes all but the first suffix entries
Example
Live Demo
https://regex101.com/r/rJ9gW3/2
Sample text
ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:stemനിന്നു|:suffix,:suffix:suffix
ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix:suffix
മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix:suffix
Sample Matches
ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:suffixനിന്നു|:suffix,
ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix
മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of a "line"
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
:stem ':stem'
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
:stem ':stem'
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
.*? any character except \n (0 or more
times (matching the least amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
:suffix ':suffix'
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[^:]+? any character except: ':' (1 or more
times (matching the least amount
possible))
----------------------------------------------------------------------
\| '|'
----------------------------------------------------------------------
:suffix ':suffix'
----------------------------------------------------------------------
[^:]* any character except: ':' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
:suffix ':suffix'
----------------------------------------------------------------------
[^:]* any character except: ':' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
$ before an optional \n, and the end of a
"line"
----------------------------------------------------------------------

How can I write a regular expression that matchs case insensitive emails?

I have following code to find user with an email.
User.findOne({
email: { $regex: new RegExp('^' + req.body.email.toLowerCase() + '$','i') }
})
It finds the user with a given email by lowercase letters and case insensitive search.
The problem is, we have some emails like john+doe#johndoe.com and this regular expression doesn't match those emails.
What should I add to regular expression to find that kind of emails?
The issue is that you're using the e-mail address, as req.body.email, unescaped in a regular expression.
As you noticed, characters that have a special meaning in regexes, like +, will cause problems. Even worse, when a user enters .* as their e-mail address, your query will match any user, which is a security concern.
What you want is to escape the e-mail address input so any special characters will be searched for as-is (have their "special meaning" stripped from them).
The easiest way is to use a module like regex-escape that will do that for you:
var escape = require('regex-escape');
...
User.findOne({
email: { $regex: new RegExp('^' + escape(req.body.email) + '$','i') }
})
Since the regex is already set to match case-insensitive, there's not need to lowercase the string.
Description
I use this expression, it's not perfect as there are some edge cases which will slip by but those are easy enough to test by simply sending the test email:
^[_a-z0-9-+]+(?:\.[_a-z0-9-+]+)*#[a-z0-9-]+(?:\.[a-z0-9-]+)*(?:\.[a-z]{2,4})$
By adding A-Z to each of the character classes I've made the same expression case insensitive.
^[_a-zA-Z0-9-+]+(?:\.[_a-zA-Z0-9-+]+)*#[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*(?:\.[a-zA-Z]{2,4})$
Example
Live Demo
https://regex101.com/r/uC5oG4/1
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of a "line"
----------------------------------------------------------------------
[_a-z0-9-+]+ any character of: '_', 'a' to 'z', '0' to
'9', '-', '+' (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
[_a-z0-9-+]+ any character of: '_', 'a' to 'z', '0'
to '9', '-', '+' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
# '#'
----------------------------------------------------------------------
[a-z0-9-]+ any character of: 'a' to 'z', '0' to '9',
'-' (1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
[a-z0-9-]+ any character of: 'a' to 'z', '0' to
'9', '-' (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
[a-z]{2,4} any character of: 'a' to 'z' (between 2
and 4 times (matching the most amount
possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
$ before an optional \n, and the end of a
"line"
----------------------------------------------------------------------

Regex that doesn't allow to write dot(.) right after dot(.)

I'm a beginner in Qt and C++ programming. I want to use a Regular expression validator in my line edit that doesn't allow to write dot(.) right after dot(.). This is my Regex that I've used :
QRegExp reName("[a-zA-Z][a-zA-Z0-9. ]+ ")
But this is not enough for my task. Please someone help me.
I'm looking for something like this - for example :
"camp.new." (accepted)
"camp..new" (not accepted)
"ca.mp.n.e.w" (accepted)
How about:
^[a-zA-Z](?:\.?[a-zA-Z0-9 ]+)+$
Explanation:
The regular expression:
^[a-zA-Z](?:\.?[a-zA-Z0-9 ]+)+$
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
[a-zA-Z] any character of: 'a' to 'z', 'A' to 'Z'
----------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
\.? '.' (optional (matching the most amount
possible))
----------------------------------------------------------------------
[a-zA-Z0-9 ]+ any character of: 'a' to 'z', 'A' to
'Z', '0' to '9', ' ' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
Generally speaking, what you want to do is to say that at each point you've got a ., it is not followed by another ., and otherwise everything is fine. A negative lookahead assertion is all you need here from the big bag of trickiness, but bear in mind that . is an RE metacharacter so there will be some backslashes too.
^(?:[^.]|\.(?!\.))*$
You might want adjust that further, of course.
In expanded form:
^ # Anchor at start
(?: # Start sub-RE
[^.] # Not a “.”
| # or...
\. (?! \. ) # a “.” if not followed by a “.”
)* # As many of the sub-RE as necessary
$ # Anchor at end
If you're RE engine anchors things anyway, you can simplify a little:
(?:[^.]|\.(?!\.))*

regex: symbols can't repeat next to eachother

I'm trying to make regex that picks all words that are a-z and with or without the symbol '.
the word needs to be at least 2 characters
cant start with the ' symbol
two ' symbols can't be next to each other
and "two character" words can't end with the ' symbol
I have being working for hours on that regex and i can't make it work:
/\b[a-z]([a-z(\')](?!\1))+\b/
it does not work and i don't know why! (the two ' symbols next to each other)
any ideas?
([a-z](?:[a-z]|'(?!'))+[a-z']|[a-z]{2})
Live # RegExPal
You probably will not need to use \b as regex is greedy and will consume all words as a whole.
This version can't be tested with RegexPal (does not recognize the lookbehind) but has custom word borders:
(?<![a-z'])([a-z](?:[a-z]|'(?!'))+[a-z']|[a-z]{2})(?![a-z'])
This should work (disclaimer: untested)
/\b(?![a-z]{2}'\b)[a-z]((?!'')['a-z])+\b/
Yours does not because you are attempting to nest a parenthesized expression inside a character class. That only adds ( and ) to the class, it will not set the value of your next \1 code.
(Edit) Added the constraint on aa'.
Assuming words are delimited by spaces:
(?:^|\s)((?:[a-z]{2})|(?:[a-z](?!.*'')[a-z']{2,}))(?:$|\s)
In action in a perl script:
my $re = qr/(?:^|\s)((?:[a-z]{2})|(?:[a-z](?!.*'')[a-z']{2,}))(?:$|\s)/;
while(<DATA>) {
chomp;
say (/$re/ ? "OK: $_" : "KO: $_");
}
__DATA__
ab
abc
a'
ab''
abc'
a''b
:!ù
output:
OK: ab
OK: abc
KO: a'
OK: ab''
OK: abc'
KO: a''b
KO: :!ù
Explanation:
The regular expression:
(?-imsx:\b((?:[a-z]{2})|(?:[a-z](?!.*'')[a-z']{2,}))\b)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[a-z]{2} any character of: 'a' to 'z' (2 times)
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
'' '\'\''
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
[a-z']{2,} any character of: 'a' to 'z', ''' (at
least 2 times (matching the most
amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

REGEX - How to match set of string with at least one special character anywhere?

I have a problem in matching password using following regex.
^[A-Za-z\d[\!\#\#\$\%\^\&\*\(\)\_\+]{1,}]{6,}$
In above expression I want user to enter at least one special anywhere with remaining characters should be alphanumeric. The password length can't be less than six.
But the above expression is allowing user to enter not any special character. Could anyone please tell me how can I restrict the user to enter at least one special character?
How about:
^(?=[\w!##$%^&*()+]{6,})(?:.*[!##$%^&*()+]+.*)$
explanation:
The regular expression:
(?-imsx:^(?=[\w!##0^&*()+]{6,})(?:.*[!##0^&*()+]+.*)$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[\w!##0^&*()+]{6,} any character of: word characters (a-z,
A-Z, 0-9, _), '!', '#', '#', '0', '^',
'&', '*', '(', ')', '+' (at least 6
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
[!##0^&*()+]+ any character of: '!', '#', '#', '0',
'^', '&', '*', '(', ')', '+' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
Instead of complicating your regex, how about iterating over the chars and counting the special ones
count = 0
for char in string:
if isspecial(char):
count = count+1
if count > 1:
reject()