Is it possible to make conditional regex of following? - regex

Hello I am wondering whether it is possible to do this type of regex:
I have certain characters representing okjects i.e. #,#,$ and operations that may be used on them like +,-,%..... every object has a different set of operations and I want my regex to find valid pairs.
So for examle I want pairs #+, #-, $+ to be matched, but yair $- not to be matched as it is invalid.
So is there any way to do this with regexes only, without doing some gymnastics inside language using regex engine?

every okject with it's own rules in []
/(#[+-]|\$[+]|#[+-])/
you need to properly escape special characters

Gymnastics is hard. Try something like /#\+|#-|\$\+/ or something like that.
Just remember, +, $, and ^ are reserved, so they'll need to be escaped.

Another approach, mix not allowed with raw combinations, but this might be slower.
/(?!\$-|\$\%)([\#\$\#][+\-\%])/, though not if there are many alternations of the first character.
my $str = '
#+, #-, $+ to be matched,
but yair $- not to be matched asit is invalid.
$% $- #% $%
';
my $regex =
qr/
(?!\$-|\$\%) # Specific combinations not allowed
(
[\#\$\#][+\-\%] # Raw combinations allowed
)
/x;
while ( $str =~ /$regex/g ) {
print "found: '$1'\n";
}
__END__
Output:
found: '#+'
found: '#-'
found: '$+'
found: '#%'

Related

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Using regular expressions to find a word with the five letters abcde, each letter appearing exactly once, in any order, with no breaks in between

For example, the word debacle would work because of debac, but seabed would not work because: 1. there is no c in any 5-character sequence that can be formed, and 2. the letter e appears twice. As another example, feedback would work because of edbac. And remember, the solution must be done using only regular expressions.
A strategy I attempted to implement was: match the first letter if it's inside [a-e], and remember it. Then find the next letter in [a-e] but not the first letter. And so on. I wasn't sure what the syntax was (or even if some syntax existed) so my code didn't work:
open(DICT, "dictionary.txt");
#words = <DICT>;
foreach my $word(#words){
if ($word =~ /([a-e])([a-e^\1])([a-e^\1^\2])([a-e^\1^\2^\3])([a-e^\1^\2^\3^\4])/
){
print $word;
}
}
I was also thinking of using (?=regex) and \G but I wasn't sure how it would work out.
/
(?= .{0,4}a )
(?= .{0,4}b )
(?= .{0,4}c )
(?= .{0,4}d )
(?= .{0,4}e )
/xs
It's probably results in faster matching to generate a pattern from all combinations.
use Algorithm::Loops qw( NextPermute );
my #pats;
my #chars = 'a'..'e';
do { push #pats, quotemeta join '', #chars; } while NextPermute(#chars);
my $re = join '|', #pats;
abcde|abced|abdce|abdec|abecd|abedc|acbde|acbed|acdbe|acdeb|acebd|acedb|adbce|adbec|adcbe|adceb|adebc|adecb|aebcd|aebdc|aecbd|aecdb|aedbc|aedcb|bacde|baced|badce|badec|baecd|baedc|bcade|bcaed|bcdae|bcdea|bcead|bceda|bdace|bdaec|bdcae|bdcea|bdeac|bdeca|beacd|beadc|becad|becda|bedac|bedca|cabde|cabed|cadbe|cadeb|caebd|caedb|cbade|cbaed|cbdae|cbdea|cbead|cbeda|cdabe|cdaeb|cdbae|cdbea|cdeab|cdeba|ceabd|ceadb|cebad|cebda|cedab|cedba|dabce|dabec|dacbe|daceb|daebc|daecb|dbace|dbaec|dbcae|dbcea|dbeac|dbeca|dcabe|dcaeb|dcbae|dcbea|dceab|dceba|deabc|deacb|debac|debca|decab|decba|eabcd|eabdc|eacbd|eacdb|eadbc|eadcb|ebacd|ebadc|ebcad|ebcda|ebdac|ebdca|ecabd|ecadb|ecbad|ecbda|ecdab|ecdba|edabc|edacb|edbac|edbca|edcab|edcba
(This will get optimised into a trie in Perl 5.10+. Before 5.10, use Regexp::List.)
Your solution is clever but unfortunately [a-e^...] doesn't work, as you found. I don't believe there is a way to mix regular and negated character classes. I can think of a workaround using lookaheads though:
/(([a-e])(?!\2)([a-e])(?!\2)(?!\3)([a-e])(?!\2)(?!\3)(?!\4])([a-e])(?!\2)(?!\3)(?!\4])(?!\5)([a-e]))/
See it here: http://rubular.com/r/6pFrJe78b6.
UPDATE: Mob points out in the comments below, that alternation can be used to compact the above:
/(([a-e])(?!\2)([a-e])(?!\2|\3)([a-e])(?!\2|\3|\4])([a-e])(?!\2|\3|\4|\5)([a-e]))/
The new demo: http://rubular.com/r/UUS7mrz6Ze.
#! perl -lw
for (qw(debacle seabed feedback)) {
print if /([a-e])(?!\1)
([a-e])(?!\1)(?!\2)
([a-e])(?!\1)(?!\2)(?!\3)
([a-e])(?!\1)(?!\2)(?!\3)(?!\4)
([a-e])/x;
}

Regular Expression issue with * laziness

Sorry in advance that this might be a little challenging to read...
I'm trying to parse a line (actually a subject line from an IMAP server) that looks like this:
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
It's a little hard to see, but there are two =?/?= pairs in the above line. (There will always be one pair; there can theoretically be many.) In each of those =?/?= pairs, I want the third argument (as defined by a ? delimiter) extracted. (In the first pair, it's "Here is som", and in the second it's "e text.")
Here's the regex I'm using:
=\?(.+)\?.\?(.*?)\?=
I want it to return two matches, one for each =?/?= pair. Instead, it's returning the entire line as a single match. I would have thought that the ? in the (.*?), to make the * operator lazy, would have kept this from happening, but obviously it doesn't.
Any suggestions?
EDIT: Per suggestions below to replace ".?" with "[^(\?=)]?" I'm now trying to do:
=\?(.+)\?.\?([^(\?=)]*?)\?=
...but it's not working, either. (I'm unsure whether [^(\?=)]*? is the proper way to test for exclusion of a two-character sequence like "?=". Is it correct?)
Try this:
\=\?([^?]+)\?.\?(.*?)\?\=
I changed the .+ to [^?]+, which means "everything except ?"
A good practice in my experience is not to use .*? but instead do use the * without the ?, but refine the character class. In this case [^?]* to match a sequence of non-question mark characters.
You can also match more complex endmarkers this way, for instance, in this case your end-limiter is ?=, so you want to match nonquestionmarks, and questionmarks followed by non-equals:
([^?]*\?[^=])*[^?]*
At this point it becomes harder to choose though. I like that this solution is stricter, but readability decreases in this case.
One solution:
=\?(.*?)\?=\s*=\?(.*?)\?=
Explanation:
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
\s* # Match spaces.
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
Test in a 'perl' program:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group 1 -> %s\nGroup 2 -> %s\n], $1, $2 if m/=\?(.*?)\?=\s*=\?(.*?)\?=/;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
Running:
perl script.pl
Results:
Group 1 -> utf-8?Q?Here is som
Group 2 -> utf-8?Q?e text.
EDIT to comment:
I would use the global modifier /.../g. Regular expression would be:
/=\?(?:[^?]*\?){2}([^?]*)/g
Explanation:
=\? # Literal characters '=?'
(?:[^?]*\?){2} # Any number of characters except '?' with a '?' after them. This process twice to omit the string 'utf-8?Q?'
([^?]*) # Save in a group next characters until found a '?'
/g # Repeat this process multiple times until end of string.
Tested in a Perl script:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group -> %s\n], $1 while m/=\?(?:[^?]*\?){2}([^?]*)/g;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?= =?utf-8?Q?more text?=
Running and results:
Group -> Here is som
Group -> e text.
Group -> more text
Thanks for everyone's answers! The simplest expression that solved my issue was this:
=\?(.*?)\?.\?(.*?)\?=
The only difference between this and my originally-posted expression was the addition of a ? (non-greedy) operator on the first ".*". Critical, and I'd forgotten it.

Perl: Parse maillog to get date/recipient in a single regex statement

I'm trying to parse my maillog, which contains a number of lines which look similar to the following line:
Jun 6 17:52:06 host sendmail[30794]: p569q3sX030792: to=<person#recipient.com>, ctladdr=<apache#host.com> (48/48), delay=00:00:03, xdelay=00:00:03, mailer=esmtp, pri=121354, relay=gmail-smtp-in.l.google.com. [1.2.3.4], dsn=2.0.0, stat=Sent (OK 1307354043 x8si28599066ict.63)
The rules I'm trying to apply are:
The date is always the first 2 words
The email address always occurs between " to=person#recipient.com, " however the email address might be surrounded by <>
There are some lines in the log which do not relate to a recipient, so I'd like to ignore those lines entirely.
The following code works for either rule individually, however I'm having trouble combining them:
if($_ =~ m/\ to=([<>a-zA-Z0-9\.\#]*),\ /g) {
print "$1\n";
}
if($_ =~ /^+(\S+\s+\S+\s)/g) {
print "$1\n";
}
As always, I'm not sure whether the regex I'm using above is "best practice" so feel free to point out anything I'm doing badly there too :)
Thanks!
print substr($_, 0, 7), "$1\n" if / to=(.+?), /;
Your date is in a fixed-length format, you don't need a regular expression to match it.
For the address, what you need is the part between to= and the next ,, so a non-greedy match is just what you need.
To match either with one regex, or them using syntax (regex1|regex2) together:
((?<\ to=)[<>a-zA-Z0-9\.\#]*(?=,\ )|^\S+\s+\S+\s)
The outer brackets preserve $1 being assigned the match.
The look behind (?<\ to=) and look ahead (?=,\ ) do not capture anything, so these regexes only capture your target string.

How can I find repeated letters with a Perl regex?

I am looking for a regex that will find repeating letters. So any letter twice or more, for example:
booooooot or abbott
I won't know the letter I am looking for ahead of time.
This is a question I was asked in interviews and then asked in interviews. Not so many people get it correct.
You can find any letter, then use \1 to find that same letter a second time (or more). If you only need to know the letter, then $1 will contain it. Otherwise you can concatenate the second match onto the first.
my $str = "Foooooobar";
$str =~ /(\w)(\1+)/;
print $1;
# prints 'o'
print $1 . $2;
# prints 'oooooo'
I think you actually want this rather than the "\w" as that includes numbers and the underscore.
([a-zA-Z])\1+
Ok, ok, I can take a hint Leon. Use this for the unicode-world or for posix stuff.
([[:alpha:]])\1+
I Think using a backreference would work:
(\w)\1+
\w is basically [a-zA-Z_0-9] so if you only want to match letters between A and Z (case insensitively), use [a-zA-Z] instead.
(EDIT: or, like Tanktalus mentioned in his comment (and as others have answered as well), [[:alpha:]], which is locale-sensitive)
Use \N to refer to previous groups:
/(\w)\1+/g
You might want to take care as to what is considered to be a letter, and this depends on your locale. Using ISO Latin-1 will allow accented Western language characters to be matched as letters. In the following program, the default locale doesn't recognise é, and thus créé fails to match. Uncomment the locale setting code, and then it begins to match.
Also note that \w includes digits and the underscore character along with all the letters. To get just the letters, you need to take the complement of the non-alphanum, digits and underscore characters. This leaves only letters.
That might be easier to understand by framing it as the question:
"What regular expression matches any digit except 3?"
The answer is:
/[^\D3]/
#! /usr/local/bin/perl
use strict;
use warnings;
# uncomment the following three lines:
# use locale;
# use POSIX;
# setlocale(LC_CTYPE, 'fr_FR.ISO8859-1');
while (<DATA>) {
chomp;
if (/([^\W_0-9])\1+/) {
print "$_: dup [$1]\n";
}
else {
print "$_: nope\n";
}
}
__DATA__
100
food
créé
a::b
The following code will return all the characters, that repeat two or more times:
my $str = "SSSannnkaaarsss";
print $str =~ /(\w)\1+/g;
Just for kicks, a completely different approach:
if ( ($str ^ substr($str,1) ) =~ /\0+/ ) {
print "found ", substr($str, $-[0], $+[0]-$-[0]+1), " at offset ", $-[0];
}
FYI, aside from RegExBuddy, a real handy free site for testing regular expressions is RegExr at gskinner.com. Handles ([[:alpha:]])(\1+) nicely.
How about:
(\w)\1+
The first part makes an unnamed group around a character, then the back-reference looks for that same character.
I think this should also work:
((\w)(?=\2))+\2
/(.)\\1{2,}+/u
'u' modifier matching with unicode