Does a regular expression exist for enzymatic cleavage? - regex

Does a regular expression exist for (theoretical) tryptic cleavage of protein sequences? The cleavage rule for trypsin is: after R or K, but not before P.
Example:
Cleavage of the sequence VGTKCCTKPESERMPCTEDYLSLILNR should result in these 3 sequences (peptides):
VGTK
CCTKPESER
MPCTEDYLSLILNR
Note that there is no cleavage after K in the second peptide (because P comes after K).
In Perl (it could just as well have been in C#, Python or Ruby):
my $seq = 'VGTRCCTKPESERMPCTEDYLSLILNR';
my #peptides = split /someRegularExpression/, $seq;
I have used this work-around (where a cut marker, =, is first inserted in the sequence and removed again if P is immediately after the cut maker):
my $seq = 'VGTRCCTKPESERMPCTEDYLSLILNR';
$seq =~ s/([RK])/$1=/g; #Main cut rule.
$seq =~ s/=P/P/g; #The exception.
my #peptides = split( /=/, $seq);
But this requires modification to a string that can potentially be very long and there can be millions of sequences. Is there a way where a regular expression can be used with split? If yes, what would the regular expression be?
Test platform: Windows XP 64 bit. ActivePerl 64 bit. From perl -v: v5.10.0 built for MSWin32-x64-multi-thread.

You indeed need to use the combination of a positive lookbehind and a negative lookahead. The correct (Perl) syntax is as follows:
my #peptides = split(/(?!P)(?<=[RK])/, $seq);

You could use look-around assertions to exclude that cases. Something like this should work:
split(/(?<=[RK](?!P))/, $seq)

You can use lookaheads and lookbehinds to match this stuff while still getting the correct position.
/(?<=[RK])(?!P)/
Should end up splitting on a point after an R or K that is not followed by a P.

In Python you can use the finditer method to return non-overlapping pattern matches including start and span information. You can then store the string offsets instead of rebuilding the string.

Related

Glob pattern expression for a hexadecimal number in TCL?

I am trying understand the difference between glob and regex patterns. I need to do some pattern matching in TCL.
The purpose is to find out if a hexadecimal value has been entered.
The value may or may not start with 0x
The value shall contain between 1 and 12 hex characters i.e 0-9, a-f, A-F and these shall follow the 0x if it exists
The thing is that glob does not allow use of {a,b} to tell about how many characters to look for. Also, at start I tried to use (0x[Xx])? but I think this is not working.
It is not essential to use glob. I can see that there are subtle differences between glob and regex. I just want to know if this can be done only through regex and not glob.
Tcl's glob patterns are much simpler than regular expressions. All they support is:
* to mean any number of any character.
? to mean any single character.
[…] to mean any single character from the set (the chars inside the brackets, which may include ranges).
\x to mean mean a literal x (which can be any character). That's how you put a glob metacharacter in a glob pattern.
They're also always anchored at both ends. (Regular expressions are much more powerful. They're also slower. You pay for power.)
To match hex numbers like 0xF00d, you'd use a glob pattern like this:
0x[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]
(or, as an actual Tcl command; we put the pattern in {braces} to avoid needing lots of backslashes for all the brackets…)
string match {0x[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]} $value
Note that we have to match an exact number of characters. (You can shorten the pattern by using case-insensitive matching, to 0x[0-9a-f][0-9a-f][0-9a-f][0-9a-f].)
Matching hex numbers is better done with regexp or scan (which also parses the hex number). Everyone likes to forget scan for parsing, yet it's quite good at it…
regexp {^0x([[:xdigit:]]+)$} $value -> theHexDigits
scan $value "0x%x" theParsedValue
The thing is that glob does not allow use of {a,b} to tell about how
many characters to look for. Also, at start I tried to use (0x[Xx])?
but I think this is not working.
A commonly used regular expression, not specific to Tcl at all, is ^(0[xX])?[A-Fa-f0-9]{1,12}$.
Update
As Donal writes, there is a power-cost tradeoff when it comes to regexp. I was curious and, for the given requirements (optional 0x prefix, range check [1,12]), found that a carefully crafted script using string operations incl. string match (see isHex1 below) outperforms regexp in this setting (see isHex2), whatever the input case:
proc isHex1 {str min max} {
set idx [string last "0x" $str]
if {$idx > 0} {
return 0
} elseif {$idx == 0} {
set str [string range $str 2 end]
}
set l [string length $str]
expr {$l >= $min && $l <= $max && [string match -nocase [string repeat {[0-9a-f]} $l] $str]}
}
proc isHex2 {str min max} {
set regex [format {^(0x)?[[:xdigit:]]{%d,%d}$} $min $max]
regexp $regex $str
}
isHex1 extends the idea of computing the string match pattern based on the input length (w/ or w/o prefix) and string repeat. My own timings suggest that isHex1 runs at least 40% faster than isHex2 (all using time, 10000 iterations), in a worst case (within range, final character decides). Other cases (e.g., out-of-range) are substantially faster.
The glob syntax is described in the string match documentation. Compared to regular expressions, glob is a blunt instrument.
With regular expressions, you get the standard character classes, including [:xdigit:] to match a hexadecimal digit.
To contrast with mrcalvin's answer, a Tcl-specific regex would be: (?i)^0x[[:xdigit:]]{1,12}$
the leading (?i) means the expression will be matched case-insensitively.
If all you care about is determining if the input is a valid number, you can use string is integer:
set s 0xdeadbeef
string is integer $s ;# => 1
set s deadbeef
string is integer $s ;# => 0
set s 0xdeadbeetle
string is integer $s ;# => 0

How to use the literal string "STDIN" in negative lookbehind in perl? [duplicate]

I have a very crazy regex that I'm trying to diagnose. It is also very long, but I have cut it down to just the following script. Run using Strawberry Perl v5.26.2.
use strict;
use warnings;
my $text = "M Y H A P P Y T E X T";
my $regex = '(?i)(?<!(Mon|Fri|Sun)day |August )abcd(?-i)';
if ($text =~ m/$regex/){
print "true\n";
}
else {
print "false\n";
}
This gives the error "Variable length lookbehind not implemented in regex."
I am hoping you can help with several issues:
I don't see why this error would occur, because all of the possible lookbehind values are 7 characters: "Monday ", "Friday ", "Sunday ", "August ".
I did not write this regex myself, and I am not sure how to interpret the syntax (?i) and (?-i). When I get rid of the (?i) the error actually goes away. How will perl interpret this part of the regex? I would think the first two characters are evaluated to "optional literal parentheses" except that the parentheses isn't escaped and also in that case I would get a different syntax error because the closing parentheses would then not be matched.
This behavior starts somewhere between Perl 5.16.3_64 and 5.26.1_64, at least in Strawberry Perl. The former version is fine with the code, the latter is not. Why did it start?
I have reduced your problem to this:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!st)A';
print ($text =~ m/$regex/i ? "true\n" : "false\n");
Due to presence of /i (case insensitive) modifier and presence of certain character combinations such as "ss" or "st" that can be replaced by a Typographic_ligature causing it to be a variable length (/August/i matches for instance on both AUGUST (6 characters) and august (5 characters, the last one being U+FB06)).
However if we remove /i (case insensitive) modifier then it works because typographic ligatures are not matched.
Solution: Use aa modifiers i.e.:
/(?<!st)A/iaa
Or in your regex:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!(Mon|Fri|Sun)day |August )abcd';
print ($text =~ m/$regex/iaa ? "true\n" : "false\n");
From perlre:
To forbid ASCII/non-ASCII matches (like "k" with "\N{KELVIN SIGN}"), specify the "a" twice, for example /aai or /aia. (The first occurrence of "a" restricts the \d, etc., and the second occurrence adds the "/i" restrictions.) But, note that code points outside the ASCII range will use Unicode rules for /i matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.
See a closely related discussion here
That's because st can be a ligature. The same happens to fi and ff:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
my $fi = 'fi';
print $fi =~ /fi/i;
So imagine something like fi|fi where, indeed, the lengths of alternatives isn't the same.
st could be represented in a 1-character stylistic ligature as st or ſt, so its length could be 2 or 1.
Quickly finding perl's full list of 2→1-character ligatures using a bash command:
$ perl -e 'print $^V'
v5.26.2
$ for lig in {a..z}{a..z}; do \
perl -e 'print if /(?<!'$lig')x/i' 2>/dev/null || echo $lig; done
ff fi fl ss st
These respectively represent the ff, fi, fl, ß, and st/ſt ligatures. (ſt represents ſt, using the obsolete long s character; it matches st and it does not match ft.)
Perl also supports the remaining stylistic ligatures, ffi and ffl for ffi and ffl, though this isn't noteworthy in this context since lookbehinds already have issues with ff and fi/fl separately.
Future releases of perl may include more stylistic ligatures, though all that remain are font-specific (e.g. Linux Libertine has stylistic ligatures for ct and ch) or debatably stylistic (such as the Dutch ij for ij or the obsolete Spanish ꝇ for ll). It doesn't seem appropriate to have this treatment for ligatures that are not entirely interchangeable (nobody would accept dœs for does), though there are other scenarios, such as including ß thanks to its uppercase form being SS.
Perl 5.16.3 (and similarly old versions) only stumble on ss (for ß) and fail to expand the other ligatures in lookbehinds (they have fixed width and will not match). I didn't seek out the bugfix to itemize exactly which versions are affected.
Perl 5.14 introduced ligature support, so earlier versions don't have this problem.
Workarounds
Workarounds for /(?<!August)x/i (only the first will properly avoid August):
/(?<!Augus[t])(?<!Augu(?=st).)x/i (absolutely comprehensive)
/(?<!Augu(?aa:st))x/i (just the st in the lookbehind is "ASCII-safe" ²)
/(?<!(?aa)August)x/i (the whole the lookbehind is "ASCII-safe" ²)
/(?<!August)x/iaa (the whole regex is "ASCII-safe" ²)
/(?<!Augus[t])x/i (breaks ligature seeking ¹)
/(?<!Augus.)x/i (slightly different, matches more)
/(?<!Augu(?-i:st))x/i (case-sensitive st in lookbehind, won't match AugusTx)
These toy with removing the case-insensitive modifier¹ or adding the ASCII-safe modifier² in various places, often requiring the regex writer to specifically know of the variable-width ligature.
The first variation (which is the only comprehensive one) matches the variable widths with two lookbehinds: first for the six character version (no ligatures as noted in the first quote below) and second for any ligatures, employing a forward lookahead (which has zero width!) for st (including the ligatures) and then accounting for its single character width with a .
Two segments of the perlre man page:
¹ Case-insensitive modifier /i & ligatures
There are a number of Unicode characters that match a sequence of
multiple characters under /i. For example, "LATIN SMALL LIGATURE
FI" should match the sequence fi. Perl is not currently able to
do this when the multiple characters are in the pattern and are
split between groupings, or when one or more are quantified. Thus
"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches [in perl 5.14+]
"\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
² ASCII-safe modifier /aa (perl 5.14+)
To forbid ASCII/non-ASCII matches (like k with \N{KELVIN SIGN}),
specify the a twice, for example /aai or /aia. (The first
occurrence of a restricts the \d, etc., and the second occurrence
adds the /i restrictions.) But, note that code points outside the
ASCII range will use Unicode rules for /i matching, so the modifier
doesn't really restrict things to just ASCII; it just forbids the
intermixing of ASCII and non-ASCII.
To summarize, this modifier provides protection for applications that
don't wish to be exposed to all of Unicode. Specifying it twice gives
added protection.
Put (?i) after lookbehind:
(?<!(Mon|Fri|Sun)day |August )(?i)abcd(?-i)
or
(?<!(Mon|Fri|Sun)day |August )(?i:abcd)
To me it seems to be a bug.

"Variable length lookbehind not implemented" but it isn't variable length

I have a very crazy regex that I'm trying to diagnose. It is also very long, but I have cut it down to just the following script. Run using Strawberry Perl v5.26.2.
use strict;
use warnings;
my $text = "M Y H A P P Y T E X T";
my $regex = '(?i)(?<!(Mon|Fri|Sun)day |August )abcd(?-i)';
if ($text =~ m/$regex/){
print "true\n";
}
else {
print "false\n";
}
This gives the error "Variable length lookbehind not implemented in regex."
I am hoping you can help with several issues:
I don't see why this error would occur, because all of the possible lookbehind values are 7 characters: "Monday ", "Friday ", "Sunday ", "August ".
I did not write this regex myself, and I am not sure how to interpret the syntax (?i) and (?-i). When I get rid of the (?i) the error actually goes away. How will perl interpret this part of the regex? I would think the first two characters are evaluated to "optional literal parentheses" except that the parentheses isn't escaped and also in that case I would get a different syntax error because the closing parentheses would then not be matched.
This behavior starts somewhere between Perl 5.16.3_64 and 5.26.1_64, at least in Strawberry Perl. The former version is fine with the code, the latter is not. Why did it start?
I have reduced your problem to this:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!st)A';
print ($text =~ m/$regex/i ? "true\n" : "false\n");
Due to presence of /i (case insensitive) modifier and presence of certain character combinations such as "ss" or "st" that can be replaced by a Typographic_ligature causing it to be a variable length (/August/i matches for instance on both AUGUST (6 characters) and august (5 characters, the last one being U+FB06)).
However if we remove /i (case insensitive) modifier then it works because typographic ligatures are not matched.
Solution: Use aa modifiers i.e.:
/(?<!st)A/iaa
Or in your regex:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!(Mon|Fri|Sun)day |August )abcd';
print ($text =~ m/$regex/iaa ? "true\n" : "false\n");
From perlre:
To forbid ASCII/non-ASCII matches (like "k" with "\N{KELVIN SIGN}"), specify the "a" twice, for example /aai or /aia. (The first occurrence of "a" restricts the \d, etc., and the second occurrence adds the "/i" restrictions.) But, note that code points outside the ASCII range will use Unicode rules for /i matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.
See a closely related discussion here
That's because st can be a ligature. The same happens to fi and ff:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
my $fi = 'fi';
print $fi =~ /fi/i;
So imagine something like fi|fi where, indeed, the lengths of alternatives isn't the same.
st could be represented in a 1-character stylistic ligature as st or ſt, so its length could be 2 or 1.
Quickly finding perl's full list of 2→1-character ligatures using a bash command:
$ perl -e 'print $^V'
v5.26.2
$ for lig in {a..z}{a..z}; do \
perl -e 'print if /(?<!'$lig')x/i' 2>/dev/null || echo $lig; done
ff fi fl ss st
These respectively represent the ff, fi, fl, ß, and st/ſt ligatures. (ſt represents ſt, using the obsolete long s character; it matches st and it does not match ft.)
Perl also supports the remaining stylistic ligatures, ffi and ffl for ffi and ffl, though this isn't noteworthy in this context since lookbehinds already have issues with ff and fi/fl separately.
Future releases of perl may include more stylistic ligatures, though all that remain are font-specific (e.g. Linux Libertine has stylistic ligatures for ct and ch) or debatably stylistic (such as the Dutch ij for ij or the obsolete Spanish ꝇ for ll). It doesn't seem appropriate to have this treatment for ligatures that are not entirely interchangeable (nobody would accept dœs for does), though there are other scenarios, such as including ß thanks to its uppercase form being SS.
Perl 5.16.3 (and similarly old versions) only stumble on ss (for ß) and fail to expand the other ligatures in lookbehinds (they have fixed width and will not match). I didn't seek out the bugfix to itemize exactly which versions are affected.
Perl 5.14 introduced ligature support, so earlier versions don't have this problem.
Workarounds
Workarounds for /(?<!August)x/i (only the first will properly avoid August):
/(?<!Augus[t])(?<!Augu(?=st).)x/i (absolutely comprehensive)
/(?<!Augu(?aa:st))x/i (just the st in the lookbehind is "ASCII-safe" ²)
/(?<!(?aa)August)x/i (the whole the lookbehind is "ASCII-safe" ²)
/(?<!August)x/iaa (the whole regex is "ASCII-safe" ²)
/(?<!Augus[t])x/i (breaks ligature seeking ¹)
/(?<!Augus.)x/i (slightly different, matches more)
/(?<!Augu(?-i:st))x/i (case-sensitive st in lookbehind, won't match AugusTx)
These toy with removing the case-insensitive modifier¹ or adding the ASCII-safe modifier² in various places, often requiring the regex writer to specifically know of the variable-width ligature.
The first variation (which is the only comprehensive one) matches the variable widths with two lookbehinds: first for the six character version (no ligatures as noted in the first quote below) and second for any ligatures, employing a forward lookahead (which has zero width!) for st (including the ligatures) and then accounting for its single character width with a .
Two segments of the perlre man page:
¹ Case-insensitive modifier /i & ligatures
There are a number of Unicode characters that match a sequence of
multiple characters under /i. For example, "LATIN SMALL LIGATURE
FI" should match the sequence fi. Perl is not currently able to
do this when the multiple characters are in the pattern and are
split between groupings, or when one or more are quantified. Thus
"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches [in perl 5.14+]
"\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
² ASCII-safe modifier /aa (perl 5.14+)
To forbid ASCII/non-ASCII matches (like k with \N{KELVIN SIGN}),
specify the a twice, for example /aai or /aia. (The first
occurrence of a restricts the \d, etc., and the second occurrence
adds the /i restrictions.) But, note that code points outside the
ASCII range will use Unicode rules for /i matching, so the modifier
doesn't really restrict things to just ASCII; it just forbids the
intermixing of ASCII and non-ASCII.
To summarize, this modifier provides protection for applications that
don't wish to be exposed to all of Unicode. Specifying it twice gives
added protection.
Put (?i) after lookbehind:
(?<!(Mon|Fri|Sun)day |August )(?i)abcd(?-i)
or
(?<!(Mon|Fri|Sun)day |August )(?i:abcd)
To me it seems to be a bug.

Getting equal number of digits on both sides of a character in a string

I have a string
$test = 'xyz45sd2-32d34-sd23-456562.abc.com'
The objective is to obtain $1 = 23 and $2 = 45 i.e equal number of digits on both sides of the last -. Note that the number of digits is variable, and is not necessarily 2.
I have tried the following:
$test1 =~ s/.*(\d+)-(\d+).*//;
But
$1 contains 3
$2 contains 456562
You can try this regex
if($test1 =~ m/(\S+)-(\S+)-([a-z]*)(\d+)-(\d\d)(\d+).*/)
{
print $4,"|",$5;
}
I assume that u need only the first 2 didgits from 456562
perl -e '"xyz45sd2-32d34-sd23-456562.abc.com" =~ /(\d{2})-(\d{2})\d*(?=\.)/; print "$1\n$2\n"'
This other entry confirms that regex does not count:
How to match word where count of characters same
Building upon GreatBigBore's idea, if there's an upper bound to the count, then you could try the or operator |. This only matches your requirement to find a match; depending on the matched count the match will be in different bins. Only one case correctly places them in $1 and $2.
(\d{3})-(\d{3})|(\d{2})-(\d{2})|(\d{1})-(\d{1})
However if you concatenate the result captures as $1$3$5 and $2$4$6, you will effectively get the 2 stings you were looking for.
Another idea is to operate iteratively, you could repeat your search on the string by increasing the number until the match fails. (\d{1})-(\d{1}) , (\d{2})-(\d{2}) ...
A binary search comes to mind making it an O{ln(N)}, N being the upper limit for the capture length.
Theoretical answer
Short answer:
What you're looking for is not possible using regular expressions.
Long Answer:
Regular expressions (as their name suggests) are a compact representation of Regular languages (Type-3 grammars in the Chomsky Heirarchy).
What you're looking for is not possible using regular expressions as you're trying to write out an expression that maintains some kind of count (some contextual information other than beginning and end). This kind of behavior cannot be modelled as a DFA(actually any Finite Automaton). The informal proof of whether a language is regular is that there exists a DFA that accepts that language. As this kind of contextual information cannot be modeled in a DFA, thus by contradiction, you cannot write a regular expression for your problem.
Practical Solution
my ($lhs,$rhs) = $test =~ /^[^-]+-[^-]+-([^-]+)-([^-.]+)\S+/;
# Alernatively and faster
my (undef,undef,$lhs,$rhs) = split /-/, $test;
# Rest is common, no matter how $lhs and $rhs is extracted.
my #left = reverse split //, $lhs;
my #right = split //, $rhs;
my $i;
for($i=0; exists($left[$i]) and exists($right[$i]) and $left[$i] =~ /\d/ and $right[$i] =~ /\d/ ; ++$i){}
--$i;
$lhs= join "", reverse #left[0..$i];
$rhs= join "", #right[0..$i];
print $lhs, "\t", $rhs, "\n";
Edit: It's possible to improve the my solution by using regular expressions to extract the required numeric portions of $lhs and $rhs instead of split, reverse and for.
as #Samveen said it's technically not possible to do in pure regex
And Like #Samveen solution here's another version
#get left and right
my (undef,undef,$left,$right) = split /-/, $test;
#get left numbers
$left =~ s/.*?(\d+)$/$1/;
##get right numbers
$right =~ s/^(\d+).*/$1/;
##get length of both
my $right_length = length $right;
my $left_length = length $left;
if ($right_length > $left_length){
#make right length as same as left length
$right =~ s/(\d{$left_length}).*/$1/;
} else {
#make left length as same as right length
$left =~ s/.*(\d{$right_length})/$1/;
}
print $left, "\t", $right, "\n";

How to check that a string is a palindrome using regular expressions?

That was an interview question that I was unable to answer:
How to check that a string is a palindrome using regular expressions?
p.s. There is already a question "How to check if the given string is palindrome?" and it gives a lot of answers in different languages, but no answer that uses regular expressions.
The answer to this question is that "it is impossible". More specifically, the interviewer is wondering if you paid attention in your computational theory class.
In your computational theory class you learned about finite state machines. A finite state machine is composed of nodes and edges. Each edge is annotated with a letter from a finite alphabet. One or more nodes are special "accepting" nodes and one node is the "start" node. As each letter is read from a given word we traverse the given edge in the machine. If we end up in an accepting state then we say that the machine "accepts" that word.
A regular expression can always be translated into an equivalent finite state machine. That is, one that accepts and rejects the same words as the regular expression (in the real world, some regexp languages allow for arbitrary functions, these don't count).
It is impossible to build a finite state machine that accepts all palindromes. The proof relies on the facts that we can easily build a string that requires an arbitrarily large number of nodes, namely the string
a^x b a^x (eg., aba, aabaa, aaabaaa, aaaabaaaa, ....)
where a^x is a repeated x times. This requires at least x nodes because, after seeing the 'b' we have to count back x times to make sure it is a palindrome.
Finally, getting back to the original question, you could tell the interviewer that you can write a regular expression that accepts all palindromes that are smaller than some finite fixed length. If there is ever a real-world application that requires identifying palindromes then it will almost certainly not include arbitrarily long ones, thus this answer would show that you can differentiate theoretical impossibilities from real-world applications. Still, the actual regexp would be quite long, much longer than equivalent 4-line program (easy exercise for the reader: write a program that identifies palindromes).
While the PCRE engine does support recursive regular expressions (see the answer by Peter Krauss), you cannot use a regex on the ICU engine (as used, for example, by Apple) to achieve this without extra code. You'll need to do something like this:
This detects any palindrome, but does require a loop (which will be required because regular expressions can't count).
$a = "teststring";
while(length $a > 1)
{
$a =~ /(.)(.*)(.)/;
die "Not a palindrome: $a" unless $1 eq $3;
$a = $2;
}
print "Palindrome";
It's not possible. Palindromes aren't defined by a regular language. (See, I DID learn something in computational theory)
With Perl regex:
/^((.)(?1)\2|.?)$/
Though, as many have pointed out, this can't be considered a regular expression if you want to be strict. Regular expressions does not support recursion.
Here's one to detect 4-letter palindromes (e.g.: deed), for any type of character:
\(.\)\(.\)\2\1
Here's one to detect 5-letter palindromes (e.g.: radar), checking for letters only:
\([a-z]\)\([a-z]\)[a-z]\2\1
So it seems we need a different regex for each possible word length.
This post on a Python mailing list includes some details as to why (Finite State Automata and pumping lemma).
Depending on how confident you are, I'd give this answer:
I wouldn't do it with a regular
expression. It's not an appropriate
use of regular expressions.
Yes, you can do it in .Net!
(?<N>.)+.?(?<-N>\k<N>)+(?(N)(?!))
You can check it here! It's a wonderful post!
StackOverflow is full of answers like "Regular expressions? nope, they don't support it. They can't support it.".
The truth is that regular expressions have nothing to do with regular grammars anymore. Modern regular expressions feature functions such as recursion and balancing groups, and the availability of their implementations is ever growing (see Ruby examples here, for instance). In my opinion, hanging onto old belief that regular expressions in our field are anything but a programming concept is just counterproductive. Instead of hating them for the word choice that is no longer the most appropriate, it is time for us to accept things and move on.
Here's a quote from Larry Wall, the creator of Perl itself:
(…) generally having to do with what we call “regular expressions”, which are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I’m not going to try to fight linguistic necessity here. I will, however, generally call them “regexes” (or “regexen”, when I’m in an Anglo-Saxon mood).
And here's a blog post by one of PHP's core developers:
As the article was quite long, here a summary of the main points:
The “regular expressions” used by programmers have very little in common with the original notion of regularity in the context of formal language theory.
Regular expressions (at least PCRE) can match all context-free languages. As such they can also match well-formed HTML and pretty much all other programming languages.
Regular expressions can match at least some context-sensitive languages.
Matching of regular expressions is NP-complete. As such you can solve any other NP problem using regular expressions.
That being said, you can match palindromes with regexes using this:
^(?'letter'[a-z])+[a-z]?(?:\k'letter'(?'-letter'))+(?(letter)(?!))$
...which obviously has nothing to do with regular grammars.
More info here: http://www.regular-expressions.info/balancing.html
As a few have already said, there's no single regexp that'll detect a general palindrome out of the box, but if you want to detect palindromes up to a certain length, you can use something like
(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1
You can also do it without using recursion:
\A(?:(.)(?=.*?((?(2)\1\2|\1))\z))*?.?\2\z
to allow a single character:
\A(?:(?:(.)(?=.*?((?(2)\1\2|\1))\z))*?.?\2|.)\z
Works with Perl, PCRE
demo
For Java:
\A(?:(.)(?=.*?(\1\2\z|(?<!(?=\2\z).{0,1000})\1\z)))*?.?\2\z
demo
It can be done in Perl now. Using recursive reference:
if($istr =~ /^((\w)(?1)\g{-1}|\w?)$/){
print $istr," is palindrome\n";
}
modified based on the near last part http://perldoc.perl.org/perlretut.html
In ruby you can use named capture groups. so something like this will work -
def palindrome?(string)
$1 if string =~ /\A(?<p>| \w | (?: (?<l>\w) \g<p> \k<l+0> ))\z/x
end
try it, it works...
1.9.2p290 :017 > palindrome?("racecar")
=> "racecar"
1.9.2p290 :018 > palindrome?("kayak")
=> "kayak"
1.9.2p290 :019 > palindrome?("woahitworks!")
=> nil
Recursive Regular Expressions can do it!
So simple and self-evident algorithm to detect a string that contains a palindrome:
(\w)(?:(?R)|\w?)\1
At rexegg.com/regex-recursion the tutorial explains how it works.
It works fine with any language, here an example adapted from the same source (link) as proof-of-concept, using PHP:
$subjects=['dont','o','oo','kook','book','paper','kayak','okonoko','aaaaa','bbbb'];
$pattern='/(\w)(?:(?R)|\w?)\1/';
foreach ($subjects as $sub) {
echo $sub." ".str_repeat('-',15-strlen($sub))."-> ";
if (preg_match($pattern,$sub,$m))
echo $m[0].(($m[0]==$sub)? "! a palindrome!\n": "\n");
else
echo "sorry, no match\n";
}
outputs
dont ------------> sorry, no match
o ---------------> sorry, no match
oo --------------> oo! a palindrome!
kook ------------> kook! a palindrome!
book ------------> oo
paper -----------> pap
kayak -----------> kayak! a palindrome!
okonoko ---------> okonoko! a palindrome!
aaaaa -----------> aaaaa! a palindrome!
bbbb ------------> bbb
Comparing
The regular expression ^((\w)(?:(?1)|\w?)\2)$ do the same job, but as yes/not instead "contains". PS: it is using a definition where "o" is not a palimbrome, "able-elba" hyphened format is not a palindrome, but "ableelba" is. Naming it definition1. When "o" and "able-elba" are palindrones, naming definition2.
Comparing with another "palindrome regexes",
^((.)(?:(?1)|.?)\2)$ the base-regex above without \w restriction, accepting "able-elba".
^((.)(?1)?\2|.)$ (#LilDevil) Use definition2 (accepts "o" and "able-elba" so differing also in the recognition of "aaaaa" and "bbbb" strings).
^((.)(?1)\2|.?)$ (#Markus) not detected "kook" neither "bbbb"
^((.)(?1)*\2|.?)$ (#Csaba) Use definition2.
NOTE: to compare you can add more words at $subjects and a line for each compared regex,
if (preg_match('/^((.)(?:(?1)|.?)\2)$/',$sub)) echo " ...reg_base($sub)!\n";
if (preg_match('/^((.)(?1)?\2|.)$/',$sub)) echo " ...reg2($sub)!\n";
if (preg_match('/^((.)(?1)\2|.?)$/',$sub)) echo " ...reg3($sub)!\n";
if (preg_match('/^((.)(?1)*\2|.?)$/',$sub)) echo " ...reg4($sub)!\n";
Here's my answer to Regex Golf's 5th level (A man, a plan). It works for up to 7 characters with the browser's Regexp (I'm using Chrome 36.0.1985.143).
^(.)(.)(?:(.).?\3?)?\2\1$
Here's one for up to 9 characters
^(.)(.)(?:(.)(?:(.).?\4?)?\3?)?\2\1$
To increase the max number of characters it'd work for, you'd repeatedly replace .? with (?:(.).?\n?)?.
It's actually easier to do it with string manipulation rather than regular expressions:
bool isPalindrome(String s1)
{
String s2 = s1.reverse;
return s2 == s1;
}
I realize this doesn't really answer the interview question, but you could use it to show how you know a better way of doing a task, and you aren't the typical "person with a hammer, who sees every problem as a nail."
Regarding the PCRE expression (from MizardX):
/^((.)(?1)\2|.?)$/
Have you tested it? On my PHP 5.3 under Win XP Pro it fails on: aaaba
Actually, I modified the expression expression slightly, to read:
/^((.)(?1)*\2|.?)$/
I think what is happening is that while the outer pair of characters are anchored, the remaining inner ones are not. This is not quite the whole answer because while it incorrectly passes on "aaaba" and "aabaacaa", it does fail correctly on "aabaaca".
I wonder whether there a fixup for this, and also,
Does the Perl example (by JF Sebastian / Zsolt) pass my tests correctly?
Csaba Gabor from Vienna
/\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/
it is valid for Oniguruma engine (which is used in Ruby)
took from Pragmatic Bookshelf
In Perl (see also Zsolt Botykai's answer):
$re = qr/
. # single letter is a palindrome
|
(.) # first letter
(??{ $re })?? # apply recursivly (not interpolated yet)
\1 # last letter
/x;
while(<>) {
chomp;
say if /^$re$/; # print palindromes
}
As pointed out by ZCHudson, determine if something is a palindrome cannot be done with an usual regexp, as the set of palindrome is not a regular language.
I totally disagree with Airsource Ltd when he says that "it's not possibles" is not the kind of answer the interviewer is looking for. During my interview, I come to this kind of question when I face a good candidate, to check if he can find the right argument when we proposed to him to do something wrong. I do not want to hire someone who will try to do something the wrong way if he knows better one.
something you can do with perl: http://www.perlmonks.org/?node_id=577368
I would explain to the interviewer that the language consisting of palindromes is not a regular language but instead context-free.
The regular expression that would match all palindromes would be infinite. Instead I would suggest he restrict himself to either a maximum size of palindromes to accept; or if all palindromes are needed use at minimum some type of NDPA, or just use the simple string reversal/equals technique.
The best you can do with regexes, before you run out of capture groups:
/(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\9\8\7\6\5\4\3\2\1/
This will match all palindromes up to 19 characters in length.
Programatcally solving for all lengths is trivial:
str == str.reverse ? true : false
I don't have the rep to comment inline yet, but the regex provided by MizardX, and modified by Csaba, can be modified further to make it work in PCRE. The only failure I have found is the single-char string, but I can test for that separately.
/^((.)(?1)?\2|.)$/
If you can make it fail on any other strings, please comment.
#!/usr/bin/perl
use strict;
use warnings;
print "Enter your string: ";
chop(my $a = scalar(<STDIN>));
my $m = (length($a)+1)/2;
if( (length($a) % 2 != 0 ) or length($a) > 1 ) {
my $r;
foreach (0 ..($m - 2)){
$r .= "(.)";
}
$r .= ".?";
foreach ( my $i = ($m-1); $i > 0; $i-- ) {
$r .= "\\$i";
}
if ( $a =~ /(.)(.).\2\1/ ){
print "$a is a palindrome\n";
}
else {
print "$a not a palindrome\n";
}
exit(1);
}
print "$a not a palindrome\n";
From automata theory its impossible to match a paliandrome of any lenght ( because that requires infinite amount of memory). But IT IS POSSIBLE to match Paliandromes of Fixed Length.
Say its possible to write a regex that matches all paliandromes of length <= 5 or <= 6 etc, but not >=5 etc where upper bound is unclear
In Ruby you can use \b(?'word'(?'letter'[a-z])\g'word'\k'letter+0'|[a-z])\b to match palindrome words such as a, dad, radar, racecar, and redivider. ps : this regex only matches palindrome words that are an odd number of letters long.
Let's see how this regex matches radar. The word boundary \b matches at the start of the string. The regex engine enters the capturing group "word". [a-z] matches r which is then stored in the stack for the capturing group "letter" at recursion level zero. Now the regex engine enters the first recursion of the group "word". (?'letter'[a-z]) matches and captures a at recursion level one. The regex enters the second recursion of the group "word". (?'letter'[a-z]) captures d at recursion level two. During the next two recursions, the group captures a and r at levels three and four. The fifth recursion fails because there are no characters left in the string for [a-z] to match. The regex engine must backtrack.
The regex engine must now try the second alternative inside the group "word". The second [a-z] in the regex matches the final r in the string. The engine now exits from a successful recursion, going one level back up to the third recursion.
After matching (&word) the engine reaches \k'letter+0'. The backreference fails because the regex engine has already reached the end of the subject string. So it backtracks once more. The second alternative now matches the a. The regex engine exits from the third recursion.
The regex engine has again matched (&word) and needs to attempt the backreference again. The backreference specifies +0 or the present level of recursion, which is 2. At this level, the capturing group matched d. The backreference fails because the next character in the string is r. Backtracking again, the second alternative matches d.
Now, \k'letter+0' matches the second a in the string. That's because the regex engine has arrived back at the first recursion during which the capturing group matched the first a. The regex engine exits the first recursion.
The regex engine is now back outside all recursion. That this level, the capturing group stored r. The backreference can now match the final r in the string. Since the engine is not inside any recursion any more, it proceeds with the remainder of the regex after the group. \b matches at the end of the string. The end of the regex is reached and radar is returned as the overall match.
here is PL/SQL code which tells whether given string is palindrome or not using regular expressions:
create or replace procedure palin_test(palin in varchar2) is
tmp varchar2(100);
i number := 0;
BEGIN
tmp := palin;
for i in 1 .. length(palin)/2 loop
if length(tmp) > 1 then
if regexp_like(tmp,'^(^.).*(\1)$') = true then
tmp := substr(palin,i+1,length(tmp)-2);
else
dbms_output.put_line('not a palindrome');
exit;
end if;
end if;
if i >= length(palin)/2 then
dbms_output.put_line('Yes ! it is a palindrome');
end if;
end loop;
end palin_test;
my $pal='malayalam';
while($pal=~/((.)(.*)\2)/){ #checking palindrome word
$pal=$3;
}
if ($pal=~/^.?$/i){ #matches single letter or no letter
print"palindrome\n";
}
else{
print"not palindrome\n";
}
This regex will detect palindromes up to 22 characters ignoring spaces, tabs, commas, and quotes.
\b(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*\11?[ \t,'"]*\10|\10?)[ \t,'"]*\9|\9?)[ \t,'"]*\8|\8?)[ \t,'"]*\7|\7?)[ \t,'"]*\6|\6?)[ \t,'"]*\5|\5?)[ \t,'"]*\4|\4?)[ \t,'"]*\3|\3?)[ \t,'"]*\2|\2?))?[ \t,'"]*\1\b
Play with it here: https://regexr.com/4tmui
I wrote an explanation of how I got that here: https://medium.com/analytics-vidhya/coding-the-impossible-palindrome-detector-with-a-regular-expressions-cd76bc23b89b
A slight refinement of Airsource Ltd's method, in pseudocode:
WHILE string.length > 1
IF /(.)(.*)\1/ matches string
string = \2
ELSE
REJECT
ACCEPT